MODULE 10: PERFORMANCE MEASURES
The Confusion Matrix
A confusion matrix is a table showing counts of samples that have been correctly
classified and those that have been incorrectly classified by a model. As an
example, suppose we have images of we have photo of people where 20 are male
and 30 are female. We want our model to classify each photo as either male or
female. With female being the positive class. In other words, the question the
model needs to answer is: ”is this a photo of a female person”. If it answers
yes (1) then it has predicted female. If it answers no (0) it has predicted male.
Table 10.1 represents the counts of predictions by the model.
Table 10.1 Confusion matrix of gender classification
Female Male
Female 25 TP 5 FN
Male 4 FP 16 TN
The rows show the people in each gender. The columns show the classi-
fication of the model. The TP (true positive) cell shows the number females
correctly classified as female. The FN (false negative) cell shows the number
females incorrectly
classified as male. The FP (false positive) shows the number of males incorrectly
classified as female. The TN shows the number of males correctly classified as
male. Most of the other measures of performance are calculated from the values
in the confusion matrix.
Accuracy
Accuracy is a very common measure of reporting performance. It intuitive and
easy for most people to understand. It is calculated using the formula below.
TP + TN
accuracy =
TP + TN + FP + FN
Error Rate
This is the opposite of the accuracy. It represents the proportion of misclassifi-
cations. it is calculated using the formula below:
FP + FN
errorRate =
TP + TN + FP + FN
Sensitivity/Recall/True positive Rate
This measures the ability of the model to classify the members of the positive
class correctly. It answers the question: what percent of the positive class were
correctly classified? It is calculated using the formula below.
1
TP
sensitivity =
TP + FN
Specificity/True Negative Rate
This measures the ability of the model to classify the members of the negative
class correctly. It answers the question: what percent of the negative class were
correctly classified? It is calculated using the formula below.
sensitivity
TN
=
TN + FP
Precision
This measures the ability of the model to classify to classify only members of the
positive class as positive. It answers the question: what percent of the samples
classified as positive are actually positive? It is calculated using the formula
below.
precision
TP
=
TP + FP
F1 Score
This is an average of the precision and the recall using the harmonic mean of
the two. The harmonic mean is a type of average that gives greater weight to
the lesser of the two values in calculating the average. It is calculated using
the formula below. It is a more reliable measure of performance for unbalanced
datasets. A unbalanced dataset is one which the number of samples of one class
is much greater than the number of samples of the other class.
F1 −
score
2 × precision × recall
=
precision + recall
Balanced Accuracy
To calculate the balanced accuracy, we calculate the accuracy for each class
separately and then average the two results. It also a more reliable measure for
unbalanced datasets. It is calculated using the formula below which assumes we
have two classes only.
balanced accuracy = 12 T PT+FP
N + TN
T N +F P
2
Area under the Curve (AUC)
This measure tries to balance the costs and benefits. For a given model, it shows
how the benefits are affected by the costs. A receiver operator characteristic
(ROC) curve are used for comparing classification models. Models with greater
area under their ROC curves are more beneficial i.e. have a reduced risk of false
positive outcome. In an ROC curve, the TPR (precision) is plotted against the
FPR (1-specificity). You will learn more about the AUC curve and how it is
calculated in the reading material provided later in the lesson.
Figure 10.1 ROC curve
Unbalanced Data Sets
Consider a dataset consisting fraudulent and legitimate online transactions. Un-
der normal circumstances, only a very small proportion of the transactions are
fraudulent. As an example, in the Credit Card Fraud Detection, only 0.18% of
the transactions are fraudulent. This is less than one percent. Such a dataset
is called imbalanced. Since most of the training examples are legitimate, a clas-
sification model will learn to classify the legitimate transactions correctly while
performing poorly on the fraudulent transactions. Some measures of perfor-
mance are not appropriate for imbalanced datasets. Accuracy for instance may
not give a true picture when applied to unbalanced dataset since the accuracy of
the majority class will dominate. Balanced accuracy is however a good measure
for imbalance datasets. Other measures such as sensitivity and specificity can
also be applied to imbalanced datasets.
There other ways for dealing with imbalances in the dataset. Oversampling
is a technique where a new dataset is created by replicating samples belonging to
the minority class so that their numbers are approximately equal to the majority
class. The training and test sets are then obtained from this new dataset.
Undersampling is another technique where samples are randomly selected from
the majority class such that the number of majority class samples selected is
approximately equal to the samples in the minority class.
3
Cross-Validation
Cross validation is a method that is used to come up with more reliable per-
formance estimates. If we use accuracy as an example, in cross-validation, you
calculate the accuracy several times (say n times), then get an average of the n
accuracy values. In k-fold cross-validation, the dataset is divided into k distinct
subsets ( d1 , d2 , . . . dk ). The training and testing cycle is repeated k times. In
the ith repetition, the subset di is used as the test set while all the other subsets
are combined and used for training. K can be any number however k = 10 is
commonly used. Leave-one-out is a special case of k-fold cross-validation where
k is equal to the number of samples in the
dataset. Therefore in each train test cycle, only one sample is used for testing.
The rest of the samples are used for training.
Cross-validation is commonly used in the selection of hyper parameters for
training algorithm. As an example, it can be used in the selection of k in the
k-nearest neighbours algorithm. It also used in deciding which model to im-
plement (model selection). The models are performances are compared using
cross-validation and the best performing model is selected.
IMPLEMENTATION OF TEXT MINING USING PYTHON
You are going to implement cross validation using GridSearchCV class. This
class determines the average score for each hyper parameter setting using cross-
validation for a given machine learning model. You will see how the best value
for the k hyper parameter in k-nearest neighbours algorithm can be determined
#import the packages and classes we will use
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import pandas as pd
#load the data
data=load_iris()
#get the target values
y=data.target
#get the labels
labels=data.target_names
#get the x values
#use the standard scaler
scaler=StandardScaler()
x=scaler.fit_transform(x)
#some metrics you can use to test performance
4
metrics.get_scorer_names()
#set the hyper parameters to be examined k=1 to 9
grid={’n_neighbors’:np.arange(1,10) }
#create the classifier
knn=KNeighborsClassifier()
#since this is a multiclass dataset. GridSearchCV will use
StratifiedKFold splitting stategy
#The data will be split into 10 subsets.
#For each value of k, }9\mathrm{ of the subsets are used for training while one
subset is reserved for testing
#This is repeated for each subset
GSKnnCv=GridSearchCV(knn,grid,cv=10, refit=False,
scoring=[’accuracy’,’balanced_accuracy’,’precision_macro’,’f1_macro’])
#The results are returned in a dictionary
GSKnnCv.cv_results_
#The dictionary can be imported into a dataframe
df=pd.DataFrame(GSKnnCv.cv_results_)
df.head(10)
x=data.data
#we select only the average scores for each number of neighbours and
the associated
#with each score for a given number of neighbours
df1=df.loc[:,
[’param_n_neighbors’,’mean_test_accuracy’,’rank_test_accuracy’,’mean_te
st_balanced_accuracy’,’rank_test_balanced_accuracy’,
’mean_test_precision_macro’,’rank_test_precision_macro’,
’mean_test_f1_macro’,’rank_test_f1_macro’]]
df1.head(10)