Classification
From Binary to Multioutput Systems
Gauranga Kumar Baishya
August 29, 2025
Gauranga Kumar Baishya Classification August 29, 2025 1 / 17
Outline
1 Introduction to Classification & MNIST
2 Training a Binary Classifier
3 Performance Measures
4 Multiclass Classification
5 Error Analysis
6 Advanced Classification Tasks
Gauranga Kumar Baishya Classification August 29, 2025 2 / 17
The MNIST Dataset: “Hello World” of ML
What is MNIST?
A dataset of 70,000 small, grayscale images of handwritten digits (0-9).
It’s a benchmark for testing new classification algorithms.
Dataset Structure
70,000 instances (images).
784 features per instance.
Each image is 28x28 pixels.
Each feature represents one
pixel’s intensity (0-255).
Figure: A set of few digits from the
MNIST dataset.
Important First Step
Gauranga Kumar Baishya Classification August 29, 2025 3 / 17
Creating a “5-Detector”
Simplifying the Problem
To start, we’ll build a binary classifier that can only distinguish between
two classes: “5” and “not-5”.
Target Vector Creation
We create new target labels that are boolean: True for all 5s, False for all
other digits:
ytrain5 = (ytrain == 5)
ytest5 = (ytest == 5)
Training an SGD Classifier
A good starting point is the Stochastic Gradient Descent (SGD)
classifier. It’s efficient and handles large datasets well.
Gauranga Kumar Baishya Classification August 29, 2025 4 / 17
The Problem with Accuracy
Initial Accuracy Score
Using 3-fold cross-validation, the SGDClassifier achieves over 93%
accuracy.
array([0.96355, 0.93795, 0.95615])
This seems great, but is it?
The Pitfall of Skewed Datasets
Let’s consider a classifier that always predicts “not-5”.
Only about 10% of the images are 5s.
So, this “dumb” classifier will be correct about 90% of the time!
This shows that accuracy is not a good performance measure for
classifiers, especially on skewed datasets.
Gauranga Kumar Baishya Classification August 29, 2025 5 / 17
The Confusion Matrix
A Better Way to Evaluate
The confusion matrix provides a much better view of a classifier’s
performance by showing the number of times instances of class A are
classified as class B.
Terminology:
True Negatives (TN):
Correctly classified as not-5.
False Positives (FP):
Incorrectly classified as 5.
False Negatives (FN):
Incorrectly classified as not-5.
Figure: Structure of a confusion matrix. True Positives (TP): Correctly
classified as 5.
Gauranga Kumar Baishya Classification August 29, 2025 6 / 17
Our 5-Detector’s Matrix
53057 1522
1325 4096
1522 non-5s were wrongly classified as 5s (FP).
1325 5s were wrongly classified as not-5s (FN).
Gauranga Kumar Baishya Classification August 29, 2025 7 / 17
Precision, Recall, and F1 Score
Precision: Accuracy of Positive Predictions
What proportion of positive identifications was actually correct?
TP
Precision =
TP + FP
For our model, precision is 4096/(4096 + 1522) ≈ 72.9%.
Recall (Sensitivity): True Positive Rate
What proportion of actual positives was identified correctly?
TP
Recall =
TP + FN
For our model, recall is 4096/(4096 + 1325) ≈ 75.6%.
Gauranga Kumar Baishya Classification August 29, 2025 8 / 17
Precision, Recall and F1 Score
F1 Score: The Harmonic Mean
A single metric that combines precision and recall. It gives more weight to
low values, so a high F1 score requires both high precision and high recall.
Precision × Recall
F1 = 2 ×
Precision + Recall
For our model, F1 is 74.22.
Gauranga Kumar Baishya Classification August 29, 2025 9 / 17
The Precision-Recall Tradeoff
Figure: Plotting precision and recall against the decision threshold.
Gauranga Kumar Baishya Classification August 29, 2025 10 / 17
The Precision-Recall Trade-off
The Inherent Conflict
Unfortunately increasing precision reduces recall, and vice versa; the
precision-recall trade-off.
How it Works: The Decision Threshold
Classifiers compute a score for each instance. If the score is above a
threshold, it’s classified as positive.
Raising the threshold: Increases precision (fewer false positives) but
decreases recall (more false negatives).
Lowering the threshold: Increases recall (fewer false negatives) but
decreases precision (more false positives).
Gauranga Kumar Baishya Classification August 29, 2025 11 / 17
The ROC Curve
Receiver Operating Characteristic (ROC)
Another common tool for binary classifiers. It plots the True Positive
Rate (Recall) against the False Positive Rate (FPR).
FPR: The ratio of negative instances that are incorrectly classified as
positive.
A good classifier stays as far away from the diagonal line as possible
(toward the top-left corner).
Facts!
A perfect classifier has an AUC
of 1.
A purely random classifier has
an AUC of 0.5.
Gauranga Kumar Baishya Classification August 29, 2025 12 / 17
Receiver Operating Characteristic (ROC) – Rule of Thumb
When to use ROC vs. Precision-Recall?
Since the ROC curve is so similar to the precision/recall (PR) curve, one
may wonder how to decide which one to use. As a rule of thumb, it is
preferable to use the PR curve whenever the positive class is rare or when
you care more about the false positives than the false negatives; otherwise,
the ROC curve is suitable. For example, when looking at the ROC curve
and the ROC AUC score for the digit classifier, one might think that the
classifier is very good. However, this is mostly because there are few
positives (5s) compared to the negatives (non-5s). In contrast, the PR
curve makes it clear that the classifier has room for improvement, as the
curve could be closer to the top-left corner.
Gauranga Kumar Baishya Classification August 29, 2025 13 / 17
Handling More Than Two Classes
Multiclass (or Multinomial) Classifiers
These classifiers can distinguish between more than two classes. Some
algorithms (like SGD, Random Forests) support this natively. Others (like
SVMs) are strictly binary.
One-vs-the-Rest (OvR) One-vs-One (OvO)
Train 1 binary classifier for Train 1 binary classifier for
each class (e.g., a 0-detector, every pair of classes (0 vs 1,
a 1-detector, etc.). To 0 vs 2, 1 vs 2, etc.). For N
classify a new image, get the classes, this requires
decision score from each N*(N-1)/2 classifiers. The
classifier and pick the class class that wins the most
with the highest score. “duels” is chosen.
Scikit-Learn automatically applies OvR or OvO based on the algorithm.
Gauranga Kumar Baishya Classification August 29, 2025 14 / 17
Improving Models by Analyzing Errors
The Multiclass Confusion Matrix
Just like with binary classification, we can create a confusion matrix to see
where the model is making mistakes.
Figure: Confusion matrix Figure: Rows are actual classes, columns are predicted.
The column for class 8 is bright, meaning many other digits are misclassified as 8s.
3s - 5s & 7s/4s - 9s, are often confused.
Gauranga Kumar Baishya Classification August 29, 2025 15 / 17
Multilabel and Multioutput Classification
Multilabel Classification
A system that can output multiple binary classes for each instance.
Example: A face-recognition system that identifies multiple people in
one photo. If it sees Alice and Charlie, the output would be [1, 0,
1].
Evaluation can be done by calculating the F1 score for each label and
averaging the result.
Multioutput Classification
A generalization of multilabel classification where each label can be
multiclass (i.e., have more than two possible values).
Example: A system that removes noise from an image. The input is
a noisy image, and the output is a clean image.
Here, the output is multilabel (one label per pixel) and each label is
multiclass (pixel intensity from 0 to 255).
Gauranga Kumar Baishya Classification August 29, 2025 16 / 17
Questions?
Thank You!
Gauranga Kumar Baishya Classification August 29, 2025 17 / 17