0% found this document useful (0 votes)

130 views31 pages

Evaluation Metrics: Anand Avati

This document discusses evaluation metrics for binary classification models. It introduces the confusion matrix and various point metrics like accuracy, precision, recall, F1 score that are calculated from the matrix. Thresholding is used to convert model scores to binary predictions for metrics calculation. Summary metrics like AUC-ROC and AUC-PRC that do not require thresholding are also covered. Threshold scanning is demonstrated as a way to understand the precision-recall tradeoff. Choosing the right metrics depends on the problem and addressing class imbalance is important. The document provides an overview of important evaluation concepts for binary classification.

Uploaded by

shivaharsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views31 pages

Evaluation Metrics: Anand Avati

Uploaded by

shivaharsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Evaluation Metrics

CS229
Anand Avati
Topics
● Why?
● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?
- Training objective (cost function) is only a proxy for real world objective.
- Metrics help capture a business goal into a quantitative target (not all errors
are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time (No Free Lunch Theorem).
- Useful for lower level tasks and debugging (like diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still,
metrics are useful and important for evaluation.
Binary Classification
● X is Input
● Y is binary Output (0/1)
● Model is ŷ = h(X)
● Two types of models
○ Models that output a categorical class directly (K Nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1

Positive labelled example

Negative labelled example

#positives
Prevalence =
#positives + #negatives

Score = 0
Score based models : Classifier
Label positive Label negative

Th
Predict Positive

0.5

Th=0.5
Predict Negative
Point metrics: Confusion Matrix
Label Positive Label Negative

9 2
Predict Positive

0.5

Th=0.5
Predict Negative

1 8
Point metrics: True Positives
Label positive Label negative

Th TP

9 2
Predict Positive

0.5 9

Th=0.5
Predict Negative

1 8
Point metrics: True Negatives
Label positive Label negative

Th TP TN

9 2
Predict Positive

0.5 9 8

Th=0.5
Predict Negative

1 8
Point metrics: False Positives
Label positive Label negative

Th TP TN FP

9 2
Predict Positive

0.5 9 8 2

Th=0.5
Predict Negative

1 8
Point metrics: False Negatives
Label positive Label negative

Th TP TN FP FN

9 2
Predict Positive

0.5 9 8 2 1

Th=0.5
Predict Negative

8
1
FP and FN also called Type-1 and Type-2 errors

Could not find true source of image to cite

Point metrics: Accuracy
Label positive Label negative

Th TP TN FP FN Acc

9 2
Predict Positive

0.5 9 8 2 1 .85

Th=0.5
Predict Negative

1 8
Point metrics: Precision
Label positive Label negative

Th TP TN FP FN Acc Pr

9 2
Predict Positive

0.5 9 8 2 1 .85 .81

Th=0.5
Predict Negative

1 8
Point metrics: Positive Recall (Sensitivity)
Label positive Label negative

Th TP TN FP FN Acc Pr Recall

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9

Th=0.5
Predict Negative

8
1
Point metrics: Negative Recall (Specificity)
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 0.8

Th=0.5
Predict Negative

1 8
Point metrics: F score
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th=0.5
Predict Negative

1 8
Point metrics: Changing threshold
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

7 2
Predict Positive

0.6 7 8 2 3 .75 .77 .7 .8 .733

Th=0.6
Predict Negative

3 8
Threshold Scanning Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
1.00 0 10 0 10 0.50 1 0 1 0
0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
Score = 1
Threshold = 0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
1.00
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
Threshold =
0.00 Score = 0 0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
How to summarize the trade-off?

{Precision, Specificity} vs Recall/Sensitivity

Summary metrics: ROC (rotated version)
Score = 1

Score = 0
Summary metrics: PRC
Score = 1

Score = 0
Summary metrics: Log-Loss motivation
Score = 1 Score = 1

Model A Model B

Score = 0 Score = 0

Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss
Score = 1 Score = 1
● These two model outputs have same ranking, and
therefore the same AU-ROC, AU-PRC, accuracy!
● Gain =
● Log loss rewards confident correct answers and
heavily penalizes confident wrong answers.
● exp(E[log-loss]) is G.M. of gains, in [0,1].
● One perfectly confident wrong prediction is fatal.
● Gaining popularity as an evaluation metric (Kaggle)
Score = 0 Score = 0
Calibration
Logistic (th=0.5):
Precision: 0.872
Recall: 0.851
F1: 0.862
Brier: 0.099

SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862
Brier: 0.163

Brier = MSE(p, y)
Unsupervised Learning
● logP(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)

○ High logP(x) on training set, but low logP(x) on test set is a measure of overfitting

○ Raw value of logP(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)

Class Imbalance: Problems
Symptom: Prevalence < 5% (no strict definition)

Metrics: may not be meaningful.

Learning: may not focus on minority class examples at all (majority class can overwhelm
logistic regression, to a lesser extent SVM)
Class Imbalance: Metrics (pathological cases)
Accuracy: Blindly predict majority class.

Log-Loss: Majority class can dominate the loss.

AUROC: Easy to keep AUC high by scoring most negatives very low.

AUPRC: Somewhat more robust than AUROC. But other challenges.

- What kind of interpolation? AUCNPR?

In general: Accuracy << AUROC << AUPRC

Multi-class (few remarks)
● Confusion matrix will be NxN (still want heavy diagonals, light off-diagonals)
● Most metrics (except accuracy) generally analysed as multiple 1-vs-many.
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute, and relative sense)
● Cost sensitive learning techniques (also helps in Binary Imbalance)
○ Assign $$ value for each block in the confusion matrix, and incorporate those into the loss
function.
Choosing Metrics
Some common patterns:

- High precision is hard constraint, do best recall (e.g search engine results,
grammar correction) -- intolerant to FP. Metric: Recall at Precision=XX%
- High recall is hard constraint, do best precision (e.g medical diag). Intolerant
to FN. Metric: Precision at Recall=100%
- Capacity constrained (by K). Metric: Precision in top-K.
- Etc.

- Choose operating threshold based on above criteria.

Thank You!

Evaluation Metrics for Binary Classification
No ratings yet
Evaluation Metrics for Binary Classification
31 pages
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
No ratings yet
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
31 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
Confusion Matrix - Explained
No ratings yet
Confusion Matrix - Explained
6 pages
Confusion Matrix and Classification Evaluation Metrics
No ratings yet
Confusion Matrix and Classification Evaluation Metrics
16 pages
Understanding Confusion Matrix Metrics
No ratings yet
Understanding Confusion Matrix Metrics
18 pages
9 Roc Auc
No ratings yet
9 Roc Auc
27 pages
Understanding Confusion Matrix in ML
No ratings yet
Understanding Confusion Matrix in ML
60 pages
Understanding Performance Metrics in AI
No ratings yet
Understanding Performance Metrics in AI
46 pages
Classification Model Evaluation Metrics
No ratings yet
Classification Model Evaluation Metrics
22 pages
9.1 Accuracy: Formula: Accuracy (True Positives + True Negatives) / (Total Observations)
No ratings yet
9.1 Accuracy: Formula: Accuracy (True Positives + True Negatives) / (Total Observations)
4 pages
Classifier Performance Metrics Explained
No ratings yet
Classifier Performance Metrics Explained
23 pages
Ads 5
No ratings yet
Ads 5
5 pages
Assignment 5
No ratings yet
Assignment 5
22 pages
Confusion Matrix in Model Evaluation
No ratings yet
Confusion Matrix in Model Evaluation
43 pages
Evaluating Models CH-3
No ratings yet
Evaluating Models CH-3
5 pages
Evaluation Measures For Machine Learning Models
No ratings yet
Evaluation Measures For Machine Learning Models
6 pages
Classification Performance Metrics Explained
100% (1)
Classification Performance Metrics Explained
30 pages
Unit 4
No ratings yet
Unit 4
20 pages
Understanding Confusion Matrix Metrics
No ratings yet
Understanding Confusion Matrix Metrics
13 pages
Model Evaluation Metrics - A Comprehensive Guide For Beginners - by Yash - Medium
No ratings yet
Model Evaluation Metrics - A Comprehensive Guide For Beginners - by Yash - Medium
9 pages
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
No ratings yet
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
13 pages
Unit - 3 Evaluation
No ratings yet
Unit - 3 Evaluation
6 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
20 pages
ML Model Evaluation Metrics
No ratings yet
ML Model Evaluation Metrics
11 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
6 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Logistic Regression Evaluation Metrics
No ratings yet
Logistic Regression Evaluation Metrics
2 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
Confusion Matrix
No ratings yet
Confusion Matrix
13 pages
Confusion Matrix
No ratings yet
Confusion Matrix
11 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
61 pages
Confusion Metrics
No ratings yet
Confusion Metrics
7 pages
Classifier Performance Evaluation Metrics
No ratings yet
Classifier Performance Evaluation Metrics
38 pages
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
No ratings yet
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
5 pages
Understanding Confusion Matrix Metrics
No ratings yet
Understanding Confusion Matrix Metrics
9 pages
Understanding Binary Classifiers and Metrics
No ratings yet
Understanding Binary Classifiers and Metrics
9 pages
Performance Metrics
No ratings yet
Performance Metrics
34 pages
Performance Metrics
No ratings yet
Performance Metrics
12 pages
Model Evaluation Metrics and Methods
No ratings yet
Model Evaluation Metrics and Methods
39 pages
Evaluating AI Models
No ratings yet
Evaluating AI Models
3 pages
Performance Measure For A Classification Model.
No ratings yet
Performance Measure For A Classification Model.
5 pages
Understanding Accuracy and Confusion Matrix
No ratings yet
Understanding Accuracy and Confusion Matrix
15 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Confusion Matrix & Metrics Guide
No ratings yet
Confusion Matrix & Metrics Guide
13 pages
3 - Model Evaluation & Validation
No ratings yet
3 - Model Evaluation & Validation
47 pages
Precision and Recall in Model Evaluation
No ratings yet
Precision and Recall in Model Evaluation
6 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Confusion Matrix and Model Evaluation
No ratings yet
Confusion Matrix and Model Evaluation
76 pages
10 Ai Evaluation tp01
No ratings yet
10 Ai Evaluation tp01
5 pages
Evaluating Classifier Accuracy Metrics
No ratings yet
Evaluating Classifier Accuracy Metrics
14 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Confusion Matrix and Classification Metrics
No ratings yet
Confusion Matrix and Classification Metrics
14 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
8 pages
Unit3 1.evaluation
No ratings yet
Unit3 1.evaluation
36 pages
Searchq Covariance+Matrix&Sca Esv 596768218&Rlz 1CDGOYI EnIN1059IN1059&Hl en-GB&Tbm Isch&Prmd Ivbn&Sxsrf
No ratings yet
Searchq Covariance+Matrix&Sca Esv 596768218&Rlz 1CDGOYI EnIN1059IN1059&Hl en-GB&Tbm Isch&Prmd Ivbn&Sxsrf
1 page
LATAM Sales Data Overview 2012-2015
No ratings yet
LATAM Sales Data Overview 2012-2015
1,140 pages
Automated Detection and Prognostics Methods
No ratings yet
Automated Detection and Prognostics Methods
1 page
PrivacyGrid: Anonymizing Location Queries
No ratings yet
PrivacyGrid: Anonymizing Location Queries
4 pages
Module 4 Algorithms For Data Science
No ratings yet
Module 4 Algorithms For Data Science
66 pages
Complex Systems Seminar Papers
100% (1)
Complex Systems Seminar Papers
199 pages
Cookie Usage and Privacy Settings
No ratings yet
Cookie Usage and Privacy Settings
67 pages
Generalization in Reinforcement Learning
No ratings yet
Generalization in Reinforcement Learning
20 pages
Effective Business Communication Skills
No ratings yet
Effective Business Communication Skills
12 pages
SQL Database Management Guide
No ratings yet
SQL Database Management Guide
8 pages
Dimensions of Communication
No ratings yet
Dimensions of Communication
3 pages
EE3304 Digital Control Solutions
No ratings yet
EE3304 Digital Control Solutions
11 pages
Cyclone Prediction with Variational RNNs
No ratings yet
Cyclone Prediction with Variational RNNs
13 pages
Understanding De-normalization in DBMS
No ratings yet
Understanding De-normalization in DBMS
10 pages
Applications of Kernel Methods: Gustavo Camps-Valls Manel Martínez-Ramón José Luis Rojo-Álvarez
No ratings yet
Applications of Kernel Methods: Gustavo Camps-Valls Manel Martínez-Ramón José Luis Rojo-Álvarez
1 page
MATLAB Image Filtering Guide
No ratings yet
MATLAB Image Filtering Guide
10 pages
Lecture 11 Unsupervised Learning
No ratings yet
Lecture 11 Unsupervised Learning
19 pages
Chapter 1-Introduction To AI
No ratings yet
Chapter 1-Introduction To AI
24 pages
Introduction To Data Science and Machine Learning
No ratings yet
Introduction To Data Science and Machine Learning
23 pages
Introduction To Neural Networks, Deep Learning (Deeplearning - Ai Course)
No ratings yet
Introduction To Neural Networks, Deep Learning (Deeplearning - Ai Course)
4 pages
Introduction to Database Management Systems
No ratings yet
Introduction to Database Management Systems
7 pages
CSE Internship Evaluation Report
No ratings yet
CSE Internship Evaluation Report
50 pages
Question Bank For Cloud Computing Final
No ratings yet
Question Bank For Cloud Computing Final
2 pages
Introduction to Adaptive Control Techniques
No ratings yet
Introduction to Adaptive Control Techniques
45 pages
Understanding the Communication Process
No ratings yet
Understanding the Communication Process
19 pages
Process Control Strategies
No ratings yet
Process Control Strategies
5 pages
Theories of Growth Control 2
No ratings yet
Theories of Growth Control 2
77 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Digital Control Systems Lecture
No ratings yet
Digital Control Systems Lecture
10 pages
DBMS Practical File
No ratings yet
DBMS Practical File
34 pages
Weka Tool Guide for Data Analysts
No ratings yet
Weka Tool Guide for Data Analysts
6 pages
Journal of Computer Science Research - Vol.5, Iss.3 July 2023
No ratings yet
Journal of Computer Science Research - Vol.5, Iss.3 July 2023
78 pages
Effective Communication in Elder Care
100% (4)
Effective Communication in Elder Care
22 pages
Robot Mapping: A Short Introduction To The Bayes Filter and Related Models
No ratings yet
Robot Mapping: A Short Introduction To The Bayes Filter and Related Models
34 pages

Evaluation Metrics: Anand Avati

Uploaded by

Evaluation Metrics: Anand Avati

Uploaded by

Evaluation Metrics

Positive labelled example

Negative labelled example

Could not find true source of image to cite

0.5 9 8 2 1 .85 .81

0.5 9 8 2 1 .85 .81 .9

Th TP TN FP FN Acc Pr Recall Spec

0.5 9 8 2 1 .85 .81 .9 0.8

Th TP TN FP FN Acc Pr Recall Spec F1

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th TP TN FP FN Acc Pr Recall Spec F1

0.6 7 8 2 3 .75 .77 .7 .8 .733

{Precision, Specificity} vs Recall/Sensitivity

○ Raw value of logP(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)

Metrics: may not be meaningful.

Log-Loss: Majority class can dominate the loss.

AUPRC: Somewhat more robust than AUROC. But other challenges.

- What kind of interpolation? AUCNPR?

In general: Accuracy << AUROC << AUPRC

- Choose operating threshold based on above criteria.

You might also like