Module 8
Learning Best Practices for Model
Evaluation and Hyperparameter Tuning
Using k-fold cross-validation to assess
model performance
• One of the key steps in building a machine learning model is to estimate its
performance on data that the model hasn't seen before.
• When we fit our model on a training dataset and use the same data to estimate
how well it performs, a model can either suffer from underfitting (high bias) if the
model is too simple, or it can overfit the training data (high variance) if the model
is too complex for the underlying training data.
• To find an acceptable bias-variance trade-off, we need to evaluate our model
carefully.
• One common cross-validation techniques holdout cross-validation and k-fold cross-
validation, can help us obtain reliable estimates of the model's generalization
performance, that is, how well the model performs on unseen data.
The holdout method
• A classic and popular approach for estimating the generalization performance of machine
learning models is holdout cross-validation.
• in typical machine learning applications, we are also interested in tuning and comparing
different parameter settings to further improve the performance for making predictions on
unseen data.
• This process is called model selection, where the term model selection refers to a given
classification problem for which we want to select the optimal values of tuning parameters
(also called hyperparameters).
• However, if we reuse the same test dataset over and over again during model selection, it
will become part of our training data and thus the model will be more likely to overfit, and
it is also not a good machine learning practice.
• A better way of using the holdout method for model selection is to separate the data into
three parts: a training set, a validation set, and a test set.
• The training set is used to fit the different models, and the performance on the
validation set is then used for the model selection.
The Concept of Holdout Method
• The validation set repeatedly evaluates the
performance of the model after training using
different parameter values. Once we are
satisfied with the tuning of hyperparameter
values, we estimate the models' generalization
performance on the test dataset
• A disadvantage of the holdout method is that
the performance estimate may be very
sensitive to how we partition the training set
into the training and validation subsets; the
estimate will vary for different samples of the
data. In the next subsection, we will take a
look at a more robust technique for
performance estimation, k-fold cross-
validation, where we repeat the holdout Concept of Holdout Cross-validation
method k times on k subsets of the training
data.
K-fold cross-validation
• In k-fold cross-validation, we randomly split the training dataset into k folds
without replacement, where k — 1 folds are used for the model training, and one
fold is used for performance evaluation.
• This procedure is repeated k times so that we obtain k models and performance
estimates.
• A good standard value for k in k-fold cross-validation is 10, as empirical evidence
shows.
• However, if we are working with relatively small training sets, it can be useful to
increase the number of folds.
• On the other hand, if we are working with large datasets, we can choose a
smaller value for k, for example, k = 5, and still obtain an accurate estimate of the
average performance of the model while reducing the computational cost of
refitting and evaluating the model on the different folds.
Since k-fold cross-validation is a resampling
technique without replacement, the advantage
of this approach is that each sample point will
be used for training and validation (as part of a
test fold) exactly once, which yields a lower-
variance estimate of the model performance
than the holdout method. The following figure
summarizes the concept behind k-fold cross-
validation with k = 10. The training dataset is
divided into 10 folds, and during the 10
iterations, nine folds are used for training, and
one fold will be used as the test set for the
model evaluation. Also, the estimated
performances Ei (for example, classification
accuracy or error) for each fold are then used to
calculate the estimated average performance E
of the model:
SAMPLE CODES
SAMPLE CODES
Debugging algorithms with
learning and validation curves
• simple yet powerful diagnostic tools that can help us
improve the performance of a learning algorithm: learning
curves and validation curves.
• to diagnose whether a learning algorithm has a problem with
overfitting (high variance) or underfitting (high bias).
high bias - low
training and cross-
validation accuracy, high variance, which
which indicates that it is indicated by the large
underfits the training gap between the training
data. Solution: and cross-validation
increase the number accuracy
of parameters Solution : we can collect
of the model, for more training data,
example, by collecting reduce the
or constructing complexity of the model,
additional features, or or increase the
by regularization parameter
decreasing the degree
of regularization
Diagnosing bias and variance problems with learning curves
SAMPLE CODES
learning curve plot
Via the train_sizes parameter in the learning_curve function, we can control the
absolute or relative number of training samples that are used to generate the
learning curves. Here, we set train_sizes=np.linspace(0.1, 1.0, 10) to use 10
evenly spaced, relative intervals for the training set sizes. By default, the
learning_curve function uses stratified k-fold cross-validation to calculate the
cross-validation accuracy of a classifier, and we set k=10 via the cv parameter
for 10-fold stratified cross-validation. Then, we simply calculated the
average accuracies from the returned cross-validated training and test scores
for the different sizes of the training set, which we plotted using Matplotlib's
plot function. Furthermore, we added the standard deviation of the average
accuracy to the plot using the fill_between function to indicate the variance of
the estimate. As we can see in the preceding learning curve plot, our model
performs quite well on both the training and validation dataset if it had seen
more than 250 samples during training. We can also see that the training
accuracy increases for training sets with fewer than 250 samples, and the gap
between validation and training accuracy widens—an indicator of an
increasing degree of overfitting.
Addressing over- and underfitting with
validation curves
• Validation curves are a useful tool for improving the performance of a
model by addressing issues such as overfitting or underfitting.
Validation curves are related to learning curves, but instead of
plotting the training and test accuracies as functions of the sample
size, we vary the values of the model parameters, for example, the
inverse regularization parameter C in logistic regression. Let's go
ahead and see how we create validation curves via scikit-learn:
Sample codes
Similar to the learning_curve function, the validation_curve function uses
stratified k-fold cross-validation by default to estimate the performance of the
classifier. Inside the validation_curve function, we specified the parameter that
we wanted to evaluate. In this case, it is C, the inverse regularization parameter of
the LogisticRegression classifier, which we wrote as 'logisticregression__C' to
access the LogisticRegression object inside the scikit-learn pipeline for a specified
value range that we set via the param_range parameter. Similar to the learning
curve example in the previous section, we plotted the average training and cross-
validation accuracies and the corresponding standard deviations. Although the
differences in the accuracy for varying values of C are subtle, we can see that the
model slightly underfits the data when we increase the regularization strength
(small values of C). However, for large values of C, it means lowering the strength
of regularization, so the model tends to slightly overfit the data. In this case, the
sweet spot appears to be between 0.01 and 0.1 of the C value.