0% found this document useful (0 votes)
84 views19 pages

Logistic Regression in Python Using Dask

The document discusses using logistic regression with Dask to classify spam emails. It introduces Dask as a tool for parallel computing in Python that can scale libraries like NumPy, Pandas, and Scikit-Learn to larger datasets. The document then loads a spam email dataset with 4601 observations and 57 predictor variables from a CSV. It performs logistic regression classification on the spam data using Dask packages to predict whether an email is spam or not spam.

Uploaded by

Ousmane Ndour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
84 views19 pages

Logistic Regression in Python Using Dask

The document discusses using logistic regression with Dask to classify spam emails. It introduces Dask as a tool for parallel computing in Python that can scale libraries like NumPy, Pandas, and Scikit-Learn to larger datasets. The document then loads a spam email dataset with 4601 observations and 57 predictor variables from a CSV. It performs logistic regression classification on the spam data using Dask packages to predict whether an email is spam or not spam.

Uploaded by

Ousmane Ndour
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 19

LOGISTIC REGRESSION IN PYTHON USING DASK

Duvérier DJIFACK ZEBAZE


Data Scientist - Quantitative Risk Analyst
Contents

1 Introduction 1
1.1 Overview of dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to install Dask? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Spam data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Logistic regression 7
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 About logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Logistic regression in dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Prediction on the test sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Confusion matrix and assessment indicators . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Log loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Precision Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Recall Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.5 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.6 Classification report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.7 Roc Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.8 Lift Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Bibliography 17

I
1 Introduction
1.1 Overview of dask
The main objective of this paper is to give you a brief overview of Dask, a parallel computing
library for Python. Dask paralyzes many libraries in Python ecosystem like Numpy, Pandas,
Scikit-Learn and many others. Dask allows them to scale either on your laptop with multi-core
and largely memory parallelism or on large distributed clusters on the cloud or otherwise. All
while providing consistent user experience that stays true to the existing Python community of
projects. Libraries like numpy, pandas and scikit-learn are popular today because they combine
high-level usable and flexible API with high performance implementations.

Here is an example of an 3D-array filled with ones with shape (100, 100, 1000).

import numpy as np

images = np.ones((100, 1000, 1000))


images

array([[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

...,

[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],

1
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],


[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]])

However these were not originally designed to scale beyond a single CPU or to data that
doesn’t fit in memory. Let change the first size of the 3D-array to 10000.

import numpy as np

images = np.ones((10000, 1000, 1000))


images

This often results in memory errors or switching libraries when you run into larger data sets.
This is what Dask can help fix. If you replace the numpy library with dask.array which uses an
umpire under the hood then everything works well.

import dask.array as da

images = da.ones((10000, 1000, 1000))


images

2
Dask can use numpy and scale it out to multi-core machines or large distributed clusters. By
integrating with existing libraries. Dask enables developers and data sientists to easily transition
from traditional single machine workflows to parallel and distributed computing whitout learning
new framework or word writing much code. This can be done anywhere we write Python including
other libraries automated scripts or Jupyter Notebooks. By default, Dask is lazy and to visualize
data which are in a variable, you have to add .compute() at the end of the variable. For more
information about dask, visit the dask home page (https://dask.org/).

1.2 How to install Dask?


To install Dask, it is easy. Click https://docs.dask.org/en/latest/install.html and follow the in-
structions.

In this tutorial, from an example of identifying ”spam”, we will show how to compute logistic
regression using Dask packages.

1.3 Spam data


We study ”spam” data (https://search.r-project.org/CRAN/refmans/kernlab/html/spam.html.
This involves identifying fraudulent electronic messages based on their characteristics (occurence
of terms, length of documents, proportion of upper case letters, etc.). We have 4601 observations
and 57 predictor variables, all quantitative. The target variable ”type” is binary {spam, non-
spom}.

We load the spam.csv file and display its dimensions.

# load pandas packages


import pandas as pd

print(pd.__version__)
1.3.5

We are using the last pandas version 1.3.5.

# Data loading
df = pd.read_csv('spam.csv')

# Data shape
print(df.shape)

(4601, 58)

A data frame with 4601 observations and 58 variables.

3
# info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 4601 non-null float64
1 address 4601 non-null float64
2 all 4601 non-null float64
3 num3d 4601 non-null float64
4 our 4601 non-null float64
5 over 4601 non-null float64
6 remove 4601 non-null float64
7 internet 4601 non-null float64
8 order 4601 non-null float64
9 mail 4601 non-null float64
10 receive 4601 non-null float64
11 will 4601 non-null float64
12 people 4601 non-null float64
13 report 4601 non-null float64
14 addresses 4601 non-null float64
15 free 4601 non-null float64
16 business 4601 non-null float64
17 email 4601 non-null float64
18 you 4601 non-null float64
19 credit 4601 non-null float64
20 your 4601 non-null float64
21 font 4601 non-null float64
22 num000 4601 non-null float64
23 money 4601 non-null float64
24 hp 4601 non-null float64
25 hpl 4601 non-null float64
26 george 4601 non-null float64
27 num650 4601 non-null float64
28 lab 4601 non-null float64
29 labs 4601 non-null float64
30 telnet 4601 non-null float64
31 num857 4601 non-null float64
32 data 4601 non-null float64
33 num415 4601 non-null float64
34 num85 4601 non-null float64
35 technology 4601 non-null float64
36 num1999 4601 non-null float64
37 parts 4601 non-null float64
38 pm 4601 non-null float64
39 direct 4601 non-null float64
40 cs 4601 non-null float64
41 meeting 4601 non-null float64

4
42 original 4601 non-null float64
43 project 4601 non-null float64
44 re 4601 non-null float64
45 edu 4601 non-null float64
46 table 4601 non-null float64
47 conference 4601 non-null float64
48 charSemicolon 4601 non-null float64
49 charRoundbracket 4601 non-null float64
50 charSquarebracket 4601 non-null float64
51 charExclamation 4601 non-null float64
52 charDollar 4601 non-null float64
53 charHash 4601 non-null float64
54 capitalAve 4601 non-null float64
55 capitalLong 4601 non-null int64
56 capitalTotal 4601 non-null int64
57 type 4601 non-null object
dtypes: float64(55), int64(2), object(1)
memory usage: 2.0+ MB
None

The last variable, type, represent the target variable. All predictor variables are quantitative.

# description
print(df['type'].value_counts())

The data set contains 2788 e-mails classified as ”nonspam” and 1813 classified as ”spam”. We
compute the relative frequence by type of e-mail.

# relatives frequences
print(df['type'].value_counts(normalize=True))

nonspam 0.605955
spam 0.394045
Name: type, dtype: float64

39.4% of e-mails are spam (type=spam) and 60.6% aren’t spam (type=nonspam). In the worst
case, if we classify all messages as type = spam (null model or default classifier), our success rate
(or good classification rate) will be 60%. We should do better with the different methods that we
will study.

Before splitting the data, we will first encode the target variable to get neither nonspam/spam
but 0/1. Dask has the LabelEncoder function to encode labels with value between 0 and n classes-
1.

# Labelencoder
from dask_ml.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['type'] = encoder.fit_transform(df['type'])

5
1.3.1 Data splitting
As scikit-learn, dask has the train test split for splitting arrays (or data frame) into random
train and test matrices (or data frames). The main difference between the two is that dask doesn’t
have the stratify parameters as sklearn. We require 0.7(70%) of indivduals for learning.

# Data splitting
from dask_ml.model_selection import train_test_split
DTrain, DTest = train_test_split(df, train_size = .7, random_state=2021)

# Train and test shape


print(f"""Shape :
- Train : {DTrain.shape}
- Test : {DTest.shape}""")

Shape :
- Train : (3220, 58)
- Test : (1381, 58)

In both cases, we check the distribution of the classes.

# In train set
print(DTrain['type'].value_counts(normalize=True))

0 0.612112
1 0.387888
Name: type, dtype: float64

# In test set
print(DTest['type'].value_counts(normalize=True))

0 0.5916
1 0.4084
Name: type, dtype: float64

There is a slight non-significant difference in the distribution of data between the training
sample and the test sample.

6
2 Logistic regression
Regression methods have become an integral component of any data analysis concerned with
describing the relationship between a response variable and one or more explanatory variables.
It is often the case that the outcome variable is discrete, taking on two more possible values. It’s
classification.

2.1 Classification
The classification based tasks are a sub-field under supervised Machine Learning, where the key
objective is to predict output labels or responses that are categorical in nature for input data
based on what the model has learned in the training phase. Ouput labels here are also known as
classes or class labels are these are categorical in nature meaning they are unordered and discrete
values. Thus, each output response belongs to a specific discrete class or category.

Over the last decade the logistic regression model has become, in many fields, the standard
method of analysis in the situation.

2.2 About logistic regression


Logistic regression is one of the most used statistical procedures in research. It is a component of
nearly all, if not all, general purpose commercial statistical packages, and is regarded as one of the
most important statistical routines in fields such as health-care analysis, medical statistics, credit
rating, ecology, social statistics and econometrics, and other similar areas. Logistic regression has
also been considered by many analysis to be important procedure in predictive analysis, as well
as in the longer established Six Sigma movement.

There is a good reason for this popularity. Unlike traditional linear or normal regression,
logistic regression is appropriate for modeling a binary variable. A binary variable has only two
walues – 1 and 0. These values may be thought of as success and failure, or of any other type
of positive and non-positive dichotomy.

In this paper, we will show how to perform logistic regression using Dask packages combined
with sklearn. As we have see in the introduction, our data has predicted variable - type - which
is binary and 57 predictor variables.

2.3 Logistic regression in dask


We want to conduct a study using logistic regression. In dask LogisticRegression, by default the
solver parameter is set to admm for Alternating Direction Method of Multipliers, either in sklearn
it is lbfgs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno Algorithm). We set the solver
to lbfgs.

# XTrain and yTrain


XTrain = DTrain.drop(['type'], axis=1)
yTrain = DTrain['type']
# import the logistic regression model from dask_ml
from dask_ml.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs') # par defaut solver = 'admm'
model.fit(XTrain.values,yTrain.values)

7
LogisticRegression(solver='lbfgs')

The model fit has two attributes :

• .coef_ : An array. The learned value for the model’s coefficients.

• .intercept_ : a float. The learned value for the intercept, if one was added to the model.
By default fit intercept set to True.

# coefficients
coef = pd.DataFrame({'coef.': model.coef_})
coef.index = XTrain.columns
print(coef)

coef.
make -0.479880
address -0.150682
all 0.135193
num3d 0.541169
our 0.659910
over 0.596929
remove 2.251704
internet 0.625953
order 0.336239
mail 0.094922
receive -0.217522
will -0.200684
people -0.003069
report 0.071286
addresses 2.528279
free 1.056557
business 0.933110
email 0.089965
you 0.064004
credit 0.611418
your 0.179545
font 0.183953
num000 2.176546
money 0.815767
hp -1.510591
hpl -1.755439
george -1.098980
num650 0.363375
lab -1.765462
labs -0.189831
telnet -1.436668
num857 -1.367995
data -0.529982
num415 0.042820
num85 -1.568667
technology 0.650104

8
num1999 0.042632
parts 0.521060
pm -0.930480
direct -0.649896
cs -3.732531
meeting -1.867262
original -1.280181
project -1.049496
re -0.791601
edu -1.063858
table -2.094519
conference -2.914653
charSemicolon -1.203818
charRoundbracket -0.152945
charSquarebracket -0.512236
charExclamation 0.270677
charDollar 5.325154
charHash 1.786209
capitalAve 0.030869
capitalLong 0.008978
capitalTotal 0.000340

We get the intercept.

# intercept
print(f"""Intercept :
{model.intercept_}""")

Intercept :
-1.5461005587075771

2.4 Prediction on the test sample


To assess the quality of our model, we apply it to the test sample. Dask’s predict function returns
a vector whose elements are Boolean.

# XTest and yTest data


XTest = DTest.drop(['type'], axis=1)
yTest = encoder.fit_transform(DTest['type'])
# Prediction
ypred = model.predict(XTest.values)
# 60 first observations
print(ypred[:60])

[False False True False False True True False False True False True
True False False False True True True True False False True False
True False False False True False True False False False False False
False False True True True True False False True False False False
True False False False True False False False False False False True]

We encode it to get 1 if True and 0 if False. We import the dask.array function.

9
# dask array
import dask.array as da

yPred = da.where(ypred==True,1,0)
print(yPred.compute()[:60])

[0 0 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0
0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1]

# relative distribution
print(pd.DataFrame({"prediction":yPred}).value_counts())

prediction
0 855
1 526
dtype: int64

There are 526 positive predictions i.e. 526 emails from the test sample are designated spam.

2.5 Confusion matrix and assessment indicators


An intuitively appealing way to summarize the results of a fitted logistic regression model is
via a classification table or confusion matrix. This matrix is the result of cross-classifying the
outcome variable, with a dichotomous variable whose values are derived from the estimated logistic
probabilities.

# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat = pd.DataFrame(confusion_matrix(yTest, yPred),
index = ["nonspam", "spam"],
columns = ["nonspam", "spam"])
print(conf_mat)

nonspam spam
nonspam 781 36
spam 74 490

To plot the result, we use an online code available here.

# plot confusion matrix


import seaborn as sns
from cf_matrix import make_confusion_matrix
sns.set_context('talk')
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ["nonspam", "spam"]
make_confusion_matrix(pd.crosstab(yTest, yPred).values,
group_names=labels,cbar=False,
categories=categories, cmap='Blues',
title='Logistic Confusion Matrix')

10
Figure 1: Confusion matrix

The predicted data results in the above diagram could be read in the following manner given
1 represents spam e-mail.

• True Positive (TP): True positive represents the value of correct predictions of positives
out of actual positive cases. Out of 564 actual positive, 490 is correctly predicted positive.
Thus, the value of True Positive is 490.

• False Positive (FP): False positive represents the value of incorrect positive predictions.
This value represents the number of nonspam (out of 817) which gets falsely predicted as
spam. Out of 817 actual nonspam, 36 is falsely predicted as spam. Thus, the value of False
Positive is 36.

• True Negative (TN): True negative represents the value of correct predictions of nonspam
out of actual nonspam cases. Out of 817 actual nonspam, 781 is correctly predicted nonspam.
Thus, the value of True Negative is 781.

• False Negative (FN): False negative represents the value of incorrect nonspam predictions.
This value represents the number of spam (out of 564) which gets falsely predicted as non-
spam. Out of 564 actual spam, 74 is falsely predicted as nonspam. Thus, the value of False
Negative is 74.

In terms of classification metrics, dask has only two metrics: accuracy score and log loss.

2.5.1 Accuracy Score


Model accuracy is a machine learning model performance metric that is defined as the ratio of true
positives and true negatives to all positive and negative observations. In other words, accuracy
tell us how often we can expect our machine learning model will correctly predict an outcome

11
out of the total number of times it made predictions. For example: Let’s assume that you were
testing your machine learning model with a dataset of 100 records and that your machine learning
model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be:
(90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our
machine learning models make on new data we haven’t seen before.

Mathematically, it represents the ratio of the sum of true positive and true negatives out of
all the predictions.
TP + TN
Accuracy Score =
TP + FN + TN + FP
The same score can be obtained by using accuracy score method from dask ml.metrics.
# accuracy
from dask_ml.metrics import accuracy_score

acc = accuracy_score(yTest, yPred)


print(f"""Accuracy
{acc}""")
Accuracy
0.9203475742215785
2.5.2 Log loss
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true
value (0 or 1 in case of binary classification). The more the predicted probability diverges from
the actual value, the higher is the log-loss value.

While training a classification model, we would want the observation to be predicted with
probability as close to the actual value (of 0 or 1) as possible. Hence, log-loss turns out to be
a good choice for a loss function during training and optimizing classification models, wherein
farther away the prediction probability from its true value is, higher the prediction is penalized.

Loglossi = − [yi ln(pi ) + (1 − yi ) ln(1 − pi )]


where i is the given observation/record, y is the actual/true value, p is the prediction probability,
and ln refers to the natural logarithm (logarithmic value using base of e) of a number.
n
1X
Logloss = Loglossi
n i=1
n
1X
= − [yi ln(pi ) + (1 − yi ) ln(1 − pi )]
n i=1

# log loss
from dask_ml.metrics import log_loss

logloss = log_loss(yTest, yPred)


print(f"""Log loss
{logloss}""")
Log loss
2.751118167232196
12
2.5.3 Precision Score
Model precision score represents the model’s ability to correctly predict the positives out of all the
positive predictions it made. The precision score is a useful measure of the success of prediction
when the classes are very imbalanced. Mathematically, it represents the ratio of true positive to
the sum of true positive and false positive.
TP
Precision Score =
FP + TP

2.5.4 Recall Score


Model recall score represents the model’s ability to correctly predict the positives out of actual
positives. This is unlike precision which measures how many predictions made by models are
actually positive out of all positive predictions made. For example: If your machine learning
model is trying to identify positive reviews, the recall score would be what percent of those
positive reviews did your machine learning model correctly predict as a positive. In other words,
it measures how good our machine learning model is at identifying all actual positives out of all
positives that exist within a dataset. The higher the recall score, the better the machine learning
model is at identifying both positive and negative examples. Recall score is a useful measure of
success of prediction when the classes are very imbalanced. Mathematically, it represents the
ratio of true positive to the sum of true positive and false negative.
TP
Recall Score =
FN + TP

2.5.5 F1-Score
Model F1 score represents the model score as a function of precision and recall score. F-score is
a machine learning model performance metric that gives equal weight to both the Precision and
Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy
metrics (it doesn’t require us to know the total number of observations). It’s often used as a
single value that provides high-level information about the model’s output quality. This is a
useful measure of the model in the scenarios where one tries to optimize either of precision or
recall score and as a result, the model performance suffers. The following represents the aspects
relating to issues with optimizing either precision or recall score:

• Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer.
However, this comes at the cost of predicting malignant cancer in patients although the
patients are healthy (a high number of FP).

• Optimize for precision helps with correctness if the patient has a malignant cancer. How-
ever, this comes at the cost of missing malignant cancer more frequently (a high number of
FN).

Mathematically, it can be represented as harmonic mean of precision and recall score.


2 × Precision Score × Recall Score
F1-Score =
Precision Score + Recall Score

13
2.5.6 Classification report
A classification report is a performance evaluation metric in machine learning. It is used to show
the precision, recall, F1 Score, and support of your trained classification model. It is one of the
performance evaluation metrics of a classification-based machine learning model. It displays your
model’s precision, recall, F1 score and support. It provides a better understanding of the overall
performance of our trained model.

# classification report
from sklearn.metrics import classification_report

print(classification_report(yTest, yPred,
target_names = ["nonspam", "spam"]))

precision recall f1-score support

nonspam 0.91 0.96 0.93 817


spam 0.93 0.87 0.90 564

accuracy 0.92 1381


macro avg 0.92 0.91 0.92 1381
weighted avg 0.92 0.92 0.92 1381

2.5.7 Roc Curve


Receiver operator characteristic curves are generally used when statisticians wish to determine
the predictive power of the model. It is also used for classification purposes. The ROC curve is
understood as the optimal relationship of the model sensitivity by one minus the specificity.

When using ROC analysis, the analyst should look at both the ROC statistic as well as at the
plot of the sensitivity versus one minus the specificity. A model with no predictive power has a
slope of 1. This represents an ROC statistic of 0.5.

• Values from 0.5 to 0.65 have little predictive power.

• Values from 0.65 to 0.80 have moderate predictive value. Many logistic models fit into this
range.

• Values greater than 0.8 and less than 0.9 are generally regarded as having strong predictive
power.

• Values of 0.9 and greater indicate the highest amount of predictive power, but models rarely
achieve values in this range.

The model is a better classifier with greater values of the ROC statistic, or area under the
curve (AUC). Beware of over-fitting with such models. Validating the model with a validation
sample or samples is recommended.

# ROC Curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

14
fpr, tpr, thresholds = roc_curve(yTest, score[:,1])
auc = roc_auc_score(yTest, yPred)

plt.plot(fpr, tpr, label = f'AUC={round(auc,2)}')


plt.plot([0,1],[0,1])
plt.xlabel('False positive rate')
plt.ylabel('True Positive rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Figure 2: Roc curve

2.5.8 Lift Curve


The Lift curve or cumulative gain curve is used tio measure the effectiveness of targeting. We
are still working on the test sample. To build it, in addition to the observed classes, we need the
probability (score) of being of the positive class provided by the model [P(type = 1 | X)]. The
.predict proba() function gives us the probability.

# probabilty predict by type


import numpy as np

score = model.predict_proba(XTest.values)

We use the dask.dataframe to put the score as a data frame.

import dask.dataframe as dd

dataframe = pd.DataFrame({'score':score[:,1]})
dask_df = dd.from_pandas(dataframe, npartitions=10)
print(dask_df.compute())

15
score
0 6.323558e-11
1 6.555002e-03
2 9.881983e-01
3 6.976487e-08
4 3.093341e-02
... ...
1376 2.407358e-01
1377 1.725983e-03
1378 4.323683e-01
1379 9.085397e-01
1380 6.323558e-11

[1381 rows x 1 columns]

# quantile
print(dask_df.quantile(q=[0, .25, .5, .75, 1.0]).compute().to_frame().T)

0.00 0.25 0.50 0.75 1.00


score 8.873611e-26 0.019862 0.275174 0.97258 1.0

We call the scikitplot package to build the lift curve. We use the plot cumulative gain()
function.

# lift curve
import matplotlib.pyplot as plt
import scikitplot as skplt

skplt.metrics.plot_cumulative_gain(y_true=yTest, y_probas=score)
plt.show()

Figure 3: Lift curve

The curve is close to the theoretical limit (reached when all type=1 is assigned a higher score
than type = 0). Our targeting is of excellent quality.

16
3 Bibliography
1. Andreas C. Müller & Sarah Guido, Introduction to Machine Learning with Python : A
guide for Data Scientists.

2. Ann A. O’Connell, Logistic Regression models for ordinal response variables.

3. Chistopher M. Bishop, Pattern recognition and Machine Learning.

4. David G. Kleinbaum & Mitchel Klein, Logistic Regression : A Self-Learning Text, Third
Edition.

5. David W. Hosmer & Stanley Lemeshow, Applied Logistic Regression, Second Edition.

6. Dipanjan Sarkar, Raghav Bali & Tushar Sharma, Practical Machine Learning with Python
: A Problem-Solver’s Guide to Building Real-World Intelligent Systems.

7. Ethem Alpaydin, Introduction to Machine Learning, Second Edition.

8. Joseph M. Hilbe, Practical Guide to Logistic Regression.

9. Laurent Rouvière, Regression logistique avec R, Université de Rennes 2, UFR Sciences


Sociales.

10. Ricco Rakotomalala (2017), Pratique de la régression logistique : Régression logistique


binaire et polytomique, version 2.0, Université Lumière Lyon 2.

17

You might also like