0% found this document useful (0 votes)
15 views16 pages

Evaluating Model Performance in ML

The document discusses the fundamentals of machine learning evaluation, focusing on data splitting methods and bias-variance handling. It emphasizes the importance of generalization, sample error, and true error, detailing methods like the holdout method and k-fold cross-validation for assessing model performance. The document also highlights the significance of avoiding data leakage and the need for proper dataset partitioning during model training and testing.

Uploaded by

Dhruv Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

Evaluating Model Performance in ML

The document discusses the fundamentals of machine learning evaluation, focusing on data splitting methods and bias-variance handling. It emphasizes the importance of generalization, sample error, and true error, detailing methods like the holdout method and k-fold cross-validation for assessing model performance. The document also highlights the significance of avoiding data leakage and the need for proper dataset partitioning during model training and testing.

Uploaded by

Dhruv Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DS605: Fundamentals of Machine Learning

Lecture 09

Evaluation - I
[Schemes for Data Split and Handling Bias-Variance]

Arpit Rana
8th August 2024
Learning = Representation + Evaluation + Optimization

Representation ✔
Choosing a representation of the learner: the hypotheses
space or the model class — the set of models that it can
possibly learn.

Evaluation
Choosing an evaluation function (also called objective
function, utility function, loss function, or scoring
function) is needed to distinguish good classifiers from
bad ones.

Optimization
��
Choosing a method to search among the models in the
hypothesis space for the highest-scoring one.
Experimental Evaluation of Learning Algorithms

The overall objective of the Learning


Algorithm is to find a hypothesis that -

● is consistent (i.e., fits the training


data), but more importantly,
Hypothesis Learner
Space 𝓗 (𝚪: S → h)
● generalizes well for previously
unseen data.

Experimental Evaluation defines ways


to Measure the Generalizability of a Final Hypothesis or
Learning Algorithm. Model (h)

Given a representation, data, and a bias, the


learning algorithm returns a final hypothesis.
Experimental Evaluation of Learning Algorithms

Sample Error
The sample error of hypothesis h with respect to the target function f and data sample S is:

It is impossible to asses
true error, so we try to
estimate it using sample
error.

True Error

The true error of hypothesis h with respect to the target function f and the distribution D is
the probability that h will misclassify an instance drawn at random according to D:
Generalizing to Unseen Data

The error on the training set is called the training error (a.k.a. resubstitution error and
in-sample error).

● The training error is not, in general a good indicator of performance on unseen data. It's
often too optimistic.

● Why?
Generalizing to Unseen Data

To predict future performance, we need to measure error on an independent dataset:

● We want a dataset that has played no part in creating the model.

● This second dataset is called the test set.

● The error on the test set is called the test error (a.k.a. out-of-sample error and
extra-sample error).

Given a sample data S, there are methodologies to better approximate the true error of the
model.
Holdout Method

● Shuffle the dataset and partition it into two disjoint sets:


Dataset
○ training set (e.g., 80% of the full dataset); and

○ test set (the rest of the full dataset).


Shuffled Dataset
● Train the estimator on the training set.

● Test the model (evaluate the predictions) on the test set. Train Test

It is essential that the test set is not used in any way to create the model. Don't even look at it!
● 'Cheating' is called leakage.
● 'Cheating' is one cause of overfitting
Holdout Method: Class Exercise

Standardization, as we know, is about scaling the data. It requires calculation of the mean and
standard deviation.

When should the mean and standard deviation be calculated? And Why?
(a) before splitting, on the entire dataset, or
(b) after splitting, on just the training set, or
(c) after splitting, on just the test set, or
(d) after splitting, on the training and test sets separately,

What to do when the model is deployed?


Facts about Holdout Method

● The disadvantages of this method are:


○ Results can vary quite a lot across different runs.
○ Informally, you might get lucky — or unlucky
i.e., in any one split, the data used for training or testing might not be
representative.

● We are training on only a subset of the available dataset, perhaps as little as 50% of it.
From so little data, we may learn a worse model and so our error measurement may be
pessimistic.

● In practice, we only use the holdout method when we have a very large dataset. The size
of the dataset mitigates the above problems.

● When we have a smaller dataset, we use a resampling method:


○ The examples get re-used for training and testing.
K-fold Cross-Validation Method

The most-used resampling method is k-fold cross-validation:

● Shuffle the dataset and partition it into k disjoint subsets of equal size.
○ Each of the partitions is called a fold.
○ Typically, k=10, so you have 10 folds.

● You take each fold in turn and use it as the test set, training the learner on the remaining
folds.

● Clearly, you can do this k times, so that each fold gets 'a turn' at being the test set.
○ By this method, each example is used exactly once for testing, and k-1 times for
training.
K-fold Cross-Validation: Pseudocode

● Shuffle the dataset D and partition it into k k= 5 folds


disjoint equal-sized subsets, D1, ... ,Dk

● for i = 1 to k: Test

○ train on D \ Di
Test
○ make predictions for Di
○ measure error (e.g. MAE) .
.
.
● Report the mean of the errors
Test
Facts about K-fold Cross-Validation

● The disadvantages of this method are:

○ The number of folds is constrained by the size of the dataset and the desire
sometimes on the part of statisticians to have folds of at least 30 examples.

○ It can be costly to train the learning algorithm k times.

○ There may still be some variability in the results due to 'lucky'/'unlucky' splits.

● The extreme is k = n, also known as leave-one-out cross-validation or LOOCV.


Nested K-fold Cross-Validation Method

In case of hyperparameter (parameters of the model class, Dataset


not of the individual model) or parameter tuning, we
partition the whole dataset into three disjoint sets:
Shuffled Dataset
● A training set to train candidate models.

● A validation set, (a.k.a. a development set or dev set)


Train Dev. Test
to evaluate the candidate models and choose the
best one.
Train and Select
● A test set to do a final unbiased evaluation of the the best model
best model.
Merge, Train and
Test the model
K-fold Cross-Validation can be applied to validation set
(inner CV) and test set (outer CV) in a nested way. Merge, Train and
Deploy the model
Model’s Performance

Training high
Underfitting
Error

low

Validation high
Overfitting
Error

low

Test high I.I.D.


Error Violation

low

Good Model
Model’s Performance

Training high
Underfitting
Error

low Underfitting Overfitting

Need More Complex Need Simpler Model


high Model
Validation
Overfitting
Error
Need Less Regularization Need More Regularization

low Need More Features Remove Extra Features

More Data Doesn’t Work Need More Data


Test high I.I.D.
Error Violation

low

Good Model
Next lecture Evaluation - II
11th August 2024

You might also like