0% found this document useful (0 votes)
8 views18 pages

How To Handle Imbalanced Datasets - by Subha - Medium

The document discusses the issue of imbalanced datasets in machine learning, where one class has significantly more observations than another, leading to biased models. It outlines six techniques to handle this problem, including under-sampling, over-sampling, SMOTE, choosing appropriate evaluation metrics, selecting suitable algorithms, and employing cost-sensitive learning. Each method has its advantages and disadvantages, and the choice of technique should depend on the specific problem being addressed.

Uploaded by

Papa Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

How To Handle Imbalanced Datasets - by Subha - Medium

The document discusses the issue of imbalanced datasets in machine learning, where one class has significantly more observations than another, leading to biased models. It outlines six techniques to handle this problem, including under-sampling, over-sampling, SMOTE, choosing appropriate evaluation metrics, selecting suitable algorithms, and employing cost-sensitive learning. Each method has its advantages and disadvantages, and the choice of technique should depend on the specific problem being addressed.

Uploaded by

Papa Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

8/13/25, 1:40 PM How to handle imbalanced datasets?

| by Subha | Medium

Search Write

How to handle imbalanced


datasets?
Subha Follow 6 min read · Jan 25, 2024

24

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 1/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Imbalanced data is a common problem in machine learning, where one


class has a significantly higher number of observations than the other. This
can lead to biased models and poor performance on the minority class.

What is imbalanced datasets:


Let us assume there is a bank that issues credit card to its customers. The
bank suspects that there are some fraudulent transactions that are going on
and hence decides to check the transactions. The bank found out that out of
10000 transactions there were 80 transaction that were found to be fraud

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 2/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

which is just 0.8%. Here the class ‘No fraud’ is the majority class whereas
‘Fraud’ is the minority class.

Imbalanced data can occur in various domains and industries. Some of the
common scenarios where imbalanced datasets may be encountered are,

1. Fraud detection: In fraud detection, the number of fraudulent


transactions is often much smaller than the number of legitimate
transactions

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 3/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

2. Cancer diagnosis: Small number of patients are diagnosed with cancer


compared to the number of people who are getting tested.

3. Anomaly detection: Anomaly detection tasks, where the goal is to


identify unusual patterns or outliers, often involve imbalanced datasets
since anomalies are expected to be rare.

4. Internet application (ecommerce): 1M people might visit the website but


only 100 people end up placing the order.

There are quite a few ways to handle imbalanced data in machine


classification problems. In this article, we are going to get into the details of
the following six techniques that are commonly used to handle imbalanced
data in classification.

1. Random under-sampling

2. Random over-sampling

3. Synthetic over-sampling: SMOTE

4. Choose the algorithm wisely

1. Resampling — Under-sampling

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 4/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Under sampling

Undersampling is the process where you randomly delete some of the


observations from the majority class in order to match the numbers with the
minority class. This technique is very simple and reduces the model
complexity, model runtime and storage requirements but comes with its
own disadvantages. Here we are actually throwing away the data, but there
can be some useful information that we are loosing. Secondly, this reduced
training set might not be the correct representation of the population due to
which the model might not generalize well on the test dataset.

2. Resampling — Over-sampling
Over-sampling is very similar to under-sampling except that instead of
reducing the samples of the majority class we are increasing the samples of
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 5/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

the minority class to match the number with the majority class. This
technique is more suitable when training data is less and we cannot afford
under-sampling.

In this technique, we try to increase the instances of the minority class by


random replication of the already present samples. This technique is better
than under-sampling as there is no information loss here and it outperforms
the under-sampling in practice.

Over sampling

The only catch is that replication of the minority class might increase the
chances of overfitting by creating many duplicate datapoints.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 6/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

3. Synthetic Minority Over sampling — SMOTE


Synthetic minority oversampling technique, also termed as SMOTE, is a
clever way to perform over-sampling over the minority class to avoid
overfitting(unlike random over-sampling that has overfitting problems). In
SMOTE, a subset of minority class is taken and new synthetic data points are
generated based on it. These synthetic data points are then added to the
original training dataset as additional examples of the minority class.

The SMOTE algorithm works as follows:

1. Select a sample, let’s call it O (for Origin), from the minority class
randomly

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 7/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

2. Find the K-Nearest Neighbors of O that belong to the same class

3. Connect O to each of these neighbors using a straight line

4. Select a scaling factor ‘z’ in the range [0,1] randomly

5. For each new connection, place a new point on the line (z*100)% away
from O. These will be our synthetic samples.

6. Repeat this process until you get the desired number of synthetic samples

SMOTE technique overcomes the disadvantage of overfitting problem due to


random oversampling as there is no replication of data. Secondly, there is no
data also lost as in under-sampling. But it has few limitations apart from
these. While creating the synthetic examples of the minority class, it might
increase the class overlap and result in additional noise to the training
dataset which is again a problem.

4. Choose appropriate Evaluation metric:


The accuracy of the classifier is the total number of correctly classified
observations divided by total number of observation. This metric is good
enough for a well balanced class but does not perform well on the
imbalanced datasets. This is because even if I create a ‘dumb’ model which
just classifies every test datapoint to the majority class, still the accuracy is

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 8/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

going to be higher. For example in my test set if I have out of 300 datapoints,
270 -ve and 30 +ve points and my dumb model just classifies every test
datapoint as negative my accuracy is going to be 90%.

Hence accuracy is not a good measure for imbalanced datasets. In such a


case we need to use other performance metrics like F1 score which is more
appropriate for imbalanced datasets.

Precision is the number of true positives divided by all positive predictions.


Precision is also called Positive Predictive Value. It is a measure of a
classifier’s exactness. Low precision indicates a high number of false
positives. Recall the number of true positives divided by the number of
positive values in the test data. The recall is also called Sensitivity or the True
Positive Rate. It is a measure of a classifier’s completeness. Low recall
indicates a high number of false negatives.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 9/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

F1 score keeps the balance between precision and recall and improves the
score only if the classifier identifies more of a certain class correctly.

5. Choosing the algorithm wisely:


While in every machine learning problem, it’s a good rule of thumb to try a
variety of algorithms, it can be especially beneficial with imbalanced
datasets. there is no one size fits all when working with imbalanced datasets.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 10/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Generally, decision tree-based algorithms perform well on imbalanced


datasets. In modern machine learning, tree ensembles (Random Forests,
Gradient Boosted Trees, etc.) almost always outperform singular decision
trees. Tree base algorithm work by learning a hierarchy of if/else questions.
This can force both classes to be addressed.

6. Cost sensitive learning:


https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 11/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Cost-sensitive learning is an approach in machine learning where the


algorithm is trained with a modified cost function that assigns different costs
to different types of errors. The goal is to make the model more sensitive to
the minority class and reduce the impact of misclassifying instances from
the minority class.

In a typical binary classification scenario, there are four possible outcomes


i.e true positive, true negative, false positive and false negative. The cost-
sensitive learning approach assigns different costs to these outcomes. The
costs can be adjusted based on the specific requirements of the problem.
The modified cost function is then used during the training phase to guide
the model towards minimizing the total cost.

The two common strategies used in cost-sensitive learning:

1. Adjusting Misclassification Costs: Assign higher costs to misclassifying


instances from the minority class (e.g., false negatives). This encourages
the model to be more cautious and attentive to the minority class.

2. Class Weights: Many machine learning algorithms allow you to assign


weights to different classes. By giving a higher weight to the minority

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 12/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

class, the algorithm places more emphasis on correctly classifying


instances from that class.

Conclusion:
In this blog we have looked into six different ways is handling imbalanced
datasets. There is no one size that fits all when working with imbalanced
datasets. Each approach has its own advantages and disadvantages.
Depending on the problem we are solving we can choose an appropriate
approach.

Imbalanced Dataset

Written by Subha Follow


111 followers · 9 following

Data Science Enthusiast | I try to simplify Data Science and other concepts
through my blogs!! I tell my stories with data!

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 13/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

No responses yet

Papabansal

What are your thoughts?

More from Subha

Subha Subha

Handling missing values in dataset Boosting — Adaboost, Gradient


— 7 methods that you need to know Boost and XGBoost
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 14/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

While working with data it is a common When it comes to ensemble models, there are
scenario for the data scientists to deal with… two major techniques — bagging and…

Feb 13, 2024 178 3 Mar 29, 2024 17

Subha Subha

12 Important Performance Metrics All you need to know about


that every data scientist must… outliers- causes, types and…
Performance metrics is an essential part in We might have heard the idiom ‘Odd man out’
machine learning to understand how well th… meaning something is very different from th…

Feb 27, 2024 105 5 Jan 31, 2024 6

See all from Subha

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 15/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Recommended from Medium

Thinking Loop Allen Liang

Gradient Descent for Humans: Contrastive Learning for


Visualizing Derivatives on Loss… Beginners: InfoNCE Loss Explain…
How Models Learn by Rolling Down Hills — InfoNCE, or Information Noise-Contrastive
One Step at a Time Estimation, is a cornerstone in contrastive…

Aug 1 29 May 3 2 1

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 16/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Aishwary Kesarwani Peng Liu

Imbalanced Dataset Fine-tuning ModernBERT for Text


Imbalanced dataset occurs during Classification with PyTorch…
classification problems when some classes… In this blog, we’ll explore how to fine-tune
ModernBERT for text classification tasks…

Feb 19 Mar 9 10

Debasish Dutta Shabana Ashraphi

Machine Learning Lesson 7: Day 8: Feature Engineering 101 —


Handling Imbalanced Data How to Make Your Data AI-Ready
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 17/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Imbalanced Data: 📌 Missed the last one? Day 7: Model Metrics


Demystified — Precision, Recall, F1 Explaine…
Jul 17
Jun 1 50 1

See more recommendations

Help Status About Careers Press Blog Privacy Rules Terms Text to speech

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 18/18

You might also like