DS605: Fundamentals of Machine Learning
Lecture 10
Evaluation - II
[Evaluation Metrics]
Arpit Rana
12th August 2024
Experimental Evaluation of Learning Algorithms
Given a representation, data, and a
bias, the learning algorithm returns a
Final Hypothesis (h).
Hypothesis Learner
Space 𝓗 (𝚪: S → h)
How to Check the Performance of
Learning Algorithms?
Final Hypothesis or
Model (h)
Evaluation Metrics
Common Measures
Experimental Evaluation of Learning Algorithms
Typical Experimental Evaluation Metrics
● Error
● Accuracy
● Precision/ Recall
Measures for Regression Problems
● Mean Absolute Error
● Squared Error Which one is better and why?
● Non-differentiability
● Robustness (sensitivity to
outliers)
● Unit changes in MSE
Measures for Regression Problems
● Misclassification Rate (a.k.a. Error Rate)
Where,
Measures for Classification Problems
True Class
Confusion (Actual)
Matrix
Positive Positive Negative Total
Hypothesized Class
True False
Positive Positive P’
(Predicted)
(TP) (FP)
Negative
False True
Negative Negative N’
(FN) (TN)
Total P N P+N
Measures for Classification Problems
F measure: weighted harmonic mean of
True Class precision and recall.
Confusion (Actual)
Matrix
Positive Positive Negative Total
Hypothesized Class
True False
Positive Positive P’
(Predicted)
(TP) (FP)
Negative
False True
Negative Negative N’
(FN) (TN)
⍺ ∈ [0, 1] and 𝛽 ∈ [0, ∞]
For ⍺ = ½, 𝛽 = 1, F measure will be balanced
Total P N P+N and is known as F1 measure.
Measures for Classification Problems
What metric would you use to measure the performance of the following classifiers.
● A classifier to detect videos that are safe for kids.
● A classifier to detect shoplifters in surveillance images.
Precision/Recall Trade-off
● Images are ranked by their classifier (whether the image is 5 or not) score.
● Those above the chosen decision threshold are considered positive.
● The higher the threshold, the lower the recall, but (in general) the higher the precision.
Precision/Recall Trade-off
How do you decide which threshold to use?
Precision/Recall Trade-off
How do you decide which threshold to use?
● A high-precision classifier is not
very useful if its recall is too low!
● If someone says “let’s reach 99%
precision,” you should ask, “at
what recall?”
To take recall into consideration, we
use other measures.
The ROC Curve
● The receiver operating characteristic (ROC) curve is another common tool used with
binary classifiers.
● It is very similar to the precision/recall curve,
○ but instead of plotting precision versus recall,
○ the ROC curve plots the true positive rate (TPR, another name for recall or
sensitivity) against the false positive rate (FPR, 1-specificity).
The ROC Curve
● Once again there is a tradeoff: the
higher the recall (TPR), the more
false positives (FPR) the classifier
produces.
● The dotted line represents the
ROC curve of a purely random
classifier;
● A good classifier stays as far away
from that line as possible (toward
the top-left corner).
AUC: Area Under the (ROC) Curve
● One way to compare classifiers is to measure the area under the curve (AUC).
● A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier
will have a ROC AUC equal to 0.5.
Note: As a rule of thumb, you should prefer the PR (precision-recall) curve whenever the
positive class is rare or when you care more about the false positives than the false negatives,
and the ROC curve otherwise.
Next lecture Loss Functions
13th August 2024