0% found this document useful (0 votes)

84 views19 pages

Logistic Regression in Python Using Dask

The document discusses using logistic regression with Dask to classify spam emails. It introduces Dask as a tool for parallel computing in Python that can scale libraries like NumPy, Pandas, and Scikit-Learn to larger datasets. The document then loads a spam email dataset with 4601 observations and 57 predictor variables from a CSV. It performs logistic regression classification on the spam data using Dask packages to predict whether an email is spam or not spam.

Uploaded by

Ousmane Ndour

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

84 views19 pages

Logistic Regression in Python Using Dask

Uploaded by

Ousmane Ndour

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 19

LOGISTIC REGRESSION IN PYTHON USING DASK

Duvérier DJIFACK ZEBAZE

Data Scientist - Quantitative Risk Analyst
Contents

1 Introduction 1
1.1 Overview of dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to install Dask? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Spam data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Logistic regression 7
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 About logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Logistic regression in dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Prediction on the test sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Confusion matrix and assessment indicators . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Log loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Precision Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Recall Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.5 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.6 Classification report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.7 Roc Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.8 Lift Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Bibliography 17

I
1 Introduction
1.1 Overview of dask
The main objective of this paper is to give you a brief overview of Dask, a parallel computing
library for Python. Dask paralyzes many libraries in Python ecosystem like Numpy, Pandas,
Scikit-Learn and many others. Dask allows them to scale either on your laptop with multi-core
and largely memory parallelism or on large distributed clusters on the cloud or otherwise. All
while providing consistent user experience that stays true to the existing Python community of
projects. Libraries like numpy, pandas and scikit-learn are popular today because they combine
high-level usable and flexible API with high performance implementations.

Here is an example of an 3D-array filled with ones with shape (100, 100, 1000).

import numpy as np

images = np.ones((100, 1000, 1000))

images

array([[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

...,

[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],

1
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],

[[1., 1., 1., ..., 1., 1., 1.],

[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]])

However these were not originally designed to scale beyond a single CPU or to data that
doesn’t fit in memory. Let change the first size of the 3D-array to 10000.

import numpy as np

images = np.ones((10000, 1000, 1000))

images

This often results in memory errors or switching libraries when you run into larger data sets.
This is what Dask can help fix. If you replace the numpy library with dask.array which uses an
umpire under the hood then everything works well.

import dask.array as da

images = da.ones((10000, 1000, 1000))

images

2
Dask can use numpy and scale it out to multi-core machines or large distributed clusters. By
integrating with existing libraries. Dask enables developers and data sientists to easily transition
from traditional single machine workflows to parallel and distributed computing whitout learning
new framework or word writing much code. This can be done anywhere we write Python including
other libraries automated scripts or Jupyter Notebooks. By default, Dask is lazy and to visualize
data which are in a variable, you have to add .compute() at the end of the variable. For more
information about dask, visit the dask home page (https://dask.org/).

1.2 How to install Dask?

To install Dask, it is easy. Click https://docs.dask.org/en/latest/install.html and follow the in-
structions.

In this tutorial, from an example of identifying ”spam”, we will show how to compute logistic
regression using Dask packages.

1.3 Spam data

We study ”spam” data (https://search.r-project.org/CRAN/refmans/kernlab/html/spam.html.
This involves identifying fraudulent electronic messages based on their characteristics (occurence
of terms, length of documents, proportion of upper case letters, etc.). We have 4601 observations
and 57 predictor variables, all quantitative. The target variable ”type” is binary {spam, non-
spom}.

We load the spam.csv file and display its dimensions.

# load pandas packages

import pandas as pd

print(pd.__version__)
1.3.5

We are using the last pandas version 1.3.5.

# Data loading
df = pd.read_csv('spam.csv')

# Data shape
print(df.shape)

(4601, 58)

A data frame with 4601 observations and 58 variables.

3
# info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 4601 non-null float64
1 address 4601 non-null float64
2 all 4601 non-null float64
3 num3d 4601 non-null float64
4 our 4601 non-null float64
5 over 4601 non-null float64
6 remove 4601 non-null float64
7 internet 4601 non-null float64
8 order 4601 non-null float64
9 mail 4601 non-null float64
10 receive 4601 non-null float64
11 will 4601 non-null float64
12 people 4601 non-null float64
13 report 4601 non-null float64
14 addresses 4601 non-null float64
15 free 4601 non-null float64
16 business 4601 non-null float64
17 email 4601 non-null float64
18 you 4601 non-null float64
19 credit 4601 non-null float64
20 your 4601 non-null float64
21 font 4601 non-null float64
22 num000 4601 non-null float64
23 money 4601 non-null float64
24 hp 4601 non-null float64
25 hpl 4601 non-null float64
26 george 4601 non-null float64
27 num650 4601 non-null float64
28 lab 4601 non-null float64
29 labs 4601 non-null float64
30 telnet 4601 non-null float64
31 num857 4601 non-null float64
32 data 4601 non-null float64
33 num415 4601 non-null float64
34 num85 4601 non-null float64
35 technology 4601 non-null float64
36 num1999 4601 non-null float64
37 parts 4601 non-null float64
38 pm 4601 non-null float64
39 direct 4601 non-null float64
40 cs 4601 non-null float64
41 meeting 4601 non-null float64

4
42 original 4601 non-null float64
43 project 4601 non-null float64
44 re 4601 non-null float64
45 edu 4601 non-null float64
46 table 4601 non-null float64
47 conference 4601 non-null float64
48 charSemicolon 4601 non-null float64
49 charRoundbracket 4601 non-null float64
50 charSquarebracket 4601 non-null float64
51 charExclamation 4601 non-null float64
52 charDollar 4601 non-null float64
53 charHash 4601 non-null float64
54 capitalAve 4601 non-null float64
55 capitalLong 4601 non-null int64
56 capitalTotal 4601 non-null int64
57 type 4601 non-null object
dtypes: float64(55), int64(2), object(1)
memory usage: 2.0+ MB
None

The last variable, type, represent the target variable. All predictor variables are quantitative.

# description
print(df['type'].value_counts())

The data set contains 2788 e-mails classified as ”nonspam” and 1813 classified as ”spam”. We
compute the relative frequence by type of e-mail.

# relatives frequences
print(df['type'].value_counts(normalize=True))

nonspam 0.605955
spam 0.394045
Name: type, dtype: float64

39.4% of e-mails are spam (type=spam) and 60.6% aren’t spam (type=nonspam). In the worst
case, if we classify all messages as type = spam (null model or default classifier), our success rate
(or good classification rate) will be 60%. We should do better with the different methods that we
will study.

Before splitting the data, we will first encode the target variable to get neither nonspam/spam
but 0/1. Dask has the LabelEncoder function to encode labels with value between 0 and n classes-
1.

# Labelencoder
from dask_ml.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['type'] = encoder.fit_transform(df['type'])

5
1.3.1 Data splitting
As scikit-learn, dask has the train test split for splitting arrays (or data frame) into random
train and test matrices (or data frames). The main difference between the two is that dask doesn’t
have the stratify parameters as sklearn. We require 0.7(70%) of indivduals for learning.

# Data splitting
from dask_ml.model_selection import train_test_split
DTrain, DTest = train_test_split(df, train_size = .7, random_state=2021)

# Train and test shape

print(f"""Shape :
- Train : {DTrain.shape}
- Test : {DTest.shape}""")

Shape :
- Train : (3220, 58)
- Test : (1381, 58)

In both cases, we check the distribution of the classes.

# In train set
print(DTrain['type'].value_counts(normalize=True))

0 0.612112
1 0.387888
Name: type, dtype: float64

# In test set
print(DTest['type'].value_counts(normalize=True))

0 0.5916
1 0.4084
Name: type, dtype: float64

There is a slight non-significant difference in the distribution of data between the training
sample and the test sample.

6
2 Logistic regression
Regression methods have become an integral component of any data analysis concerned with
describing the relationship between a response variable and one or more explanatory variables.
It is often the case that the outcome variable is discrete, taking on two more possible values. It’s
classification.

2.1 Classification
The classification based tasks are a sub-field under supervised Machine Learning, where the key
objective is to predict output labels or responses that are categorical in nature for input data
based on what the model has learned in the training phase. Ouput labels here are also known as
classes or class labels are these are categorical in nature meaning they are unordered and discrete
values. Thus, each output response belongs to a specific discrete class or category.

Over the last decade the logistic regression model has become, in many fields, the standard
method of analysis in the situation.

2.2 About logistic regression

Logistic regression is one of the most used statistical procedures in research. It is a component of
nearly all, if not all, general purpose commercial statistical packages, and is regarded as one of the
most important statistical routines in fields such as health-care analysis, medical statistics, credit
rating, ecology, social statistics and econometrics, and other similar areas. Logistic regression has
also been considered by many analysis to be important procedure in predictive analysis, as well
as in the longer established Six Sigma movement.

There is a good reason for this popularity. Unlike traditional linear or normal regression,
logistic regression is appropriate for modeling a binary variable. A binary variable has only two
walues – 1 and 0. These values may be thought of as success and failure, or of any other type
of positive and non-positive dichotomy.

In this paper, we will show how to perform logistic regression using Dask packages combined
with sklearn. As we have see in the introduction, our data has predicted variable - type - which
is binary and 57 predictor variables.

2.3 Logistic regression in dask

We want to conduct a study using logistic regression. In dask LogisticRegression, by default the
solver parameter is set to admm for Alternating Direction Method of Multipliers, either in sklearn
it is lbfgs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno Algorithm). We set the solver
to lbfgs.

# XTrain and yTrain

XTrain = DTrain.drop(['type'], axis=1)
yTrain = DTrain['type']
# import the logistic regression model from dask_ml
from dask_ml.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs') # par defaut solver = 'admm'
model.fit(XTrain.values,yTrain.values)

7
LogisticRegression(solver='lbfgs')

The model fit has two attributes :

• .coef_ : An array. The learned value for the model’s coefficients.

• .intercept_ : a float. The learned value for the intercept, if one was added to the model.
By default fit intercept set to True.

# coefficients
coef = pd.DataFrame({'coef.': model.coef_})
coef.index = XTrain.columns
print(coef)

coef.
make -0.479880
address -0.150682
all 0.135193
num3d 0.541169
our 0.659910
over 0.596929
remove 2.251704
internet 0.625953
order 0.336239
mail 0.094922
receive -0.217522
will -0.200684
people -0.003069
report 0.071286
addresses 2.528279
free 1.056557
business 0.933110
email 0.089965
you 0.064004
credit 0.611418
your 0.179545
font 0.183953
num000 2.176546
money 0.815767
hp -1.510591
hpl -1.755439
george -1.098980
num650 0.363375
lab -1.765462
labs -0.189831
telnet -1.436668
num857 -1.367995
data -0.529982
num415 0.042820
num85 -1.568667
technology 0.650104

8
num1999 0.042632
parts 0.521060
pm -0.930480
direct -0.649896
cs -3.732531
meeting -1.867262
original -1.280181
project -1.049496
re -0.791601
edu -1.063858
table -2.094519
conference -2.914653
charSemicolon -1.203818
charRoundbracket -0.152945
charSquarebracket -0.512236
charExclamation 0.270677
charDollar 5.325154
charHash 1.786209
capitalAve 0.030869
capitalLong 0.008978
capitalTotal 0.000340

We get the intercept.

# intercept
print(f"""Intercept :
{model.intercept_}""")

Intercept :
-1.5461005587075771

2.4 Prediction on the test sample

To assess the quality of our model, we apply it to the test sample. Dask’s predict function returns
a vector whose elements are Boolean.

# XTest and yTest data

XTest = DTest.drop(['type'], axis=1)
yTest = encoder.fit_transform(DTest['type'])
# Prediction
ypred = model.predict(XTest.values)
# 60 first observations
print(ypred[:60])

[False False True False False True True False False True False True
True False False False True True True True False False True False
True False False False True False True False False False False False
False False True True True True False False True False False False
True False False False True False False False False False False True]

We encode it to get 1 if True and 0 if False. We import the dask.array function.

9
# dask array
import dask.array as da

yPred = da.where(ypred==True,1,0)
print(yPred.compute()[:60])

[0 0 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0
0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1]

# relative distribution
print(pd.DataFrame({"prediction":yPred}).value_counts())

prediction
0 855
1 526
dtype: int64

There are 526 positive predictions i.e. 526 emails from the test sample are designated spam.

2.5 Confusion matrix and assessment indicators

An intuitively appealing way to summarize the results of a fitted logistic regression model is
via a classification table or confusion matrix. This matrix is the result of cross-classifying the
outcome variable, with a dichotomous variable whose values are derived from the estimated logistic
probabilities.

# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat = pd.DataFrame(confusion_matrix(yTest, yPred),
index = ["nonspam", "spam"],
columns = ["nonspam", "spam"])
print(conf_mat)

nonspam spam
nonspam 781 36
spam 74 490

To plot the result, we use an online code available here.

# plot confusion matrix

import seaborn as sns
from cf_matrix import make_confusion_matrix
sns.set_context('talk')
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ["nonspam", "spam"]
make_confusion_matrix(pd.crosstab(yTest, yPred).values,
group_names=labels,cbar=False,
categories=categories, cmap='Blues',
title='Logistic Confusion Matrix')

10
Figure 1: Confusion matrix

The predicted data results in the above diagram could be read in the following manner given
1 represents spam e-mail.

• True Positive (TP): True positive represents the value of correct predictions of positives
out of actual positive cases. Out of 564 actual positive, 490 is correctly predicted positive.
Thus, the value of True Positive is 490.

• False Positive (FP): False positive represents the value of incorrect positive predictions.
This value represents the number of nonspam (out of 817) which gets falsely predicted as
spam. Out of 817 actual nonspam, 36 is falsely predicted as spam. Thus, the value of False
Positive is 36.

• True Negative (TN): True negative represents the value of correct predictions of nonspam
out of actual nonspam cases. Out of 817 actual nonspam, 781 is correctly predicted nonspam.
Thus, the value of True Negative is 781.

• False Negative (FN): False negative represents the value of incorrect nonspam predictions.
This value represents the number of spam (out of 564) which gets falsely predicted as non-
spam. Out of 564 actual spam, 74 is falsely predicted as nonspam. Thus, the value of False
Negative is 74.

In terms of classification metrics, dask has only two metrics: accuracy score and log loss.

2.5.1 Accuracy Score

Model accuracy is a machine learning model performance metric that is defined as the ratio of true
positives and true negatives to all positive and negative observations. In other words, accuracy
tell us how often we can expect our machine learning model will correctly predict an outcome

11
out of the total number of times it made predictions. For example: Let’s assume that you were
testing your machine learning model with a dataset of 100 records and that your machine learning
model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be:
(90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our
machine learning models make on new data we haven’t seen before.

Mathematically, it represents the ratio of the sum of true positive and true negatives out of
all the predictions.
TP + TN
Accuracy Score =
TP + FN + TN + FP
The same score can be obtained by using accuracy score method from dask ml.metrics.
# accuracy
from dask_ml.metrics import accuracy_score

acc = accuracy_score(yTest, yPred)

print(f"""Accuracy
{acc}""")
Accuracy
0.9203475742215785
2.5.2 Log loss
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true
value (0 or 1 in case of binary classification). The more the predicted probability diverges from
the actual value, the higher is the log-loss value.

While training a classification model, we would want the observation to be predicted with
probability as close to the actual value (of 0 or 1) as possible. Hence, log-loss turns out to be
a good choice for a loss function during training and optimizing classification models, wherein
farther away the prediction probability from its true value is, higher the prediction is penalized.

Loglossi = − [yi ln(pi ) + (1 − yi ) ln(1 − pi )]

where i is the given observation/record, y is the actual/true value, p is the prediction probability,
and ln refers to the natural logarithm (logarithmic value using base of e) of a number.
n
1X
Logloss = Loglossi
n i=1
n
1X
= − [yi ln(pi ) + (1 − yi ) ln(1 − pi )]
n i=1

# log loss
from dask_ml.metrics import log_loss

logloss = log_loss(yTest, yPred)

print(f"""Log loss
{logloss}""")
Log loss
2.751118167232196
12
2.5.3 Precision Score
Model precision score represents the model’s ability to correctly predict the positives out of all the
positive predictions it made. The precision score is a useful measure of the success of prediction
when the classes are very imbalanced. Mathematically, it represents the ratio of true positive to
the sum of true positive and false positive.
TP
Precision Score =
FP + TP

2.5.4 Recall Score

Model recall score represents the model’s ability to correctly predict the positives out of actual
positives. This is unlike precision which measures how many predictions made by models are
actually positive out of all positive predictions made. For example: If your machine learning
model is trying to identify positive reviews, the recall score would be what percent of those
positive reviews did your machine learning model correctly predict as a positive. In other words,
it measures how good our machine learning model is at identifying all actual positives out of all
positives that exist within a dataset. The higher the recall score, the better the machine learning
model is at identifying both positive and negative examples. Recall score is a useful measure of
success of prediction when the classes are very imbalanced. Mathematically, it represents the
ratio of true positive to the sum of true positive and false negative.
TP
Recall Score =
FN + TP

2.5.5 F1-Score
Model F1 score represents the model score as a function of precision and recall score. F-score is
a machine learning model performance metric that gives equal weight to both the Precision and
Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy
metrics (it doesn’t require us to know the total number of observations). It’s often used as a
single value that provides high-level information about the model’s output quality. This is a
useful measure of the model in the scenarios where one tries to optimize either of precision or
recall score and as a result, the model performance suffers. The following represents the aspects
relating to issues with optimizing either precision or recall score:

• Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer.
However, this comes at the cost of predicting malignant cancer in patients although the
patients are healthy (a high number of FP).

• Optimize for precision helps with correctness if the patient has a malignant cancer. How-
ever, this comes at the cost of missing malignant cancer more frequently (a high number of
FN).

Mathematically, it can be represented as harmonic mean of precision and recall score.

2 × Precision Score × Recall Score
F1-Score =
Precision Score + Recall Score

13
2.5.6 Classification report
A classification report is a performance evaluation metric in machine learning. It is used to show
the precision, recall, F1 Score, and support of your trained classification model. It is one of the
performance evaluation metrics of a classification-based machine learning model. It displays your
model’s precision, recall, F1 score and support. It provides a better understanding of the overall
performance of our trained model.

# classification report
from sklearn.metrics import classification_report

print(classification_report(yTest, yPred,
target_names = ["nonspam", "spam"]))

precision recall f1-score support

nonspam 0.91 0.96 0.93 817

spam 0.93 0.87 0.90 564

accuracy 0.92 1381

macro avg 0.92 0.91 0.92 1381
weighted avg 0.92 0.92 0.92 1381

2.5.7 Roc Curve

Receiver operator characteristic curves are generally used when statisticians wish to determine
the predictive power of the model. It is also used for classification purposes. The ROC curve is
understood as the optimal relationship of the model sensitivity by one minus the specificity.

When using ROC analysis, the analyst should look at both the ROC statistic as well as at the
plot of the sensitivity versus one minus the specificity. A model with no predictive power has a
slope of 1. This represents an ROC statistic of 0.5.

• Values from 0.5 to 0.65 have little predictive power.

• Values from 0.65 to 0.80 have moderate predictive value. Many logistic models fit into this
range.

• Values greater than 0.8 and less than 0.9 are generally regarded as having strong predictive
power.

• Values of 0.9 and greater indicate the highest amount of predictive power, but models rarely
achieve values in this range.

The model is a better classifier with greater values of the ROC statistic, or area under the
curve (AUC). Beware of over-fitting with such models. Validating the model with a validation
sample or samples is recommended.

# ROC Curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

14
fpr, tpr, thresholds = roc_curve(yTest, score[:,1])
auc = roc_auc_score(yTest, yPred)

plt.plot(fpr, tpr, label = f'AUC={round(auc,2)}')

plt.plot([0,1],[0,1])
plt.xlabel('False positive rate')
plt.ylabel('True Positive rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Figure 2: Roc curve

2.5.8 Lift Curve

The Lift curve or cumulative gain curve is used tio measure the effectiveness of targeting. We
are still working on the test sample. To build it, in addition to the observed classes, we need the
probability (score) of being of the positive class provided by the model [P(type = 1 | X)]. The
.predict proba() function gives us the probability.

# probabilty predict by type

import numpy as np

score = model.predict_proba(XTest.values)

We use the dask.dataframe to put the score as a data frame.

import dask.dataframe as dd

dataframe = pd.DataFrame({'score':score[:,1]})
dask_df = dd.from_pandas(dataframe, npartitions=10)
print(dask_df.compute())

15
score
0 6.323558e-11
1 6.555002e-03
2 9.881983e-01
3 6.976487e-08
4 3.093341e-02
... ...
1376 2.407358e-01
1377 1.725983e-03
1378 4.323683e-01
1379 9.085397e-01
1380 6.323558e-11

[1381 rows x 1 columns]

# quantile
print(dask_df.quantile(q=[0, .25, .5, .75, 1.0]).compute().to_frame().T)

0.00 0.25 0.50 0.75 1.00

score 8.873611e-26 0.019862 0.275174 0.97258 1.0

We call the scikitplot package to build the lift curve. We use the plot cumulative gain()
function.

# lift curve
import matplotlib.pyplot as plt
import scikitplot as skplt

skplt.metrics.plot_cumulative_gain(y_true=yTest, y_probas=score)
plt.show()

Figure 3: Lift curve

The curve is close to the theoretical limit (reached when all type=1 is assigned a higher score
than type = 0). Our targeting is of excellent quality.

16
3 Bibliography
1. Andreas C. Müller & Sarah Guido, Introduction to Machine Learning with Python : A
guide for Data Scientists.

2. Ann A. O’Connell, Logistic Regression models for ordinal response variables.

3. Chistopher M. Bishop, Pattern recognition and Machine Learning.

4. David G. Kleinbaum & Mitchel Klein, Logistic Regression : A Self-Learning Text, Third
Edition.

5. David W. Hosmer & Stanley Lemeshow, Applied Logistic Regression, Second Edition.

6. Dipanjan Sarkar, Raghav Bali & Tushar Sharma, Practical Machine Learning with Python
: A Problem-Solver’s Guide to Building Real-World Intelligent Systems.

7. Ethem Alpaydin, Introduction to Machine Learning, Second Edition.

8. Joseph M. Hilbe, Practical Guide to Logistic Regression.

9. Laurent Rouvière, Regression logistique avec R, Université de Rennes 2, UFR Sciences

Sociales.

10. Ricco Rakotomalala (2017), Pratique de la régression logistique : Régression logistique

binaire et polytomique, version 2.0, Université Lumière Lyon 2.

R Statistics For Comparing Means Interior
100% (1)
R Statistics For Comparing Means Interior
205 pages
Complete SQL Notes
81% (53)
Complete SQL Notes
18 pages
BD Com Switch-Basic Configuration Commands
17% (6)
BD Com Switch-Basic Configuration Commands
92 pages
Headfirst Gdal
No ratings yet
Headfirst Gdal
146 pages
ae9P57Rv6Fm9p iW2kV5MV5gc3iJ4zE
No ratings yet
ae9P57Rv6Fm9p iW2kV5MV5gc3iJ4zE
30 pages
?UTF-8?Q?CSE 20-98 Lindberg So CC 88derberg - PDF?
No ratings yet
?UTF-8?Q?CSE 20-98 Lindberg So CC 88derberg - PDF?
100 pages
Durexforth-V2 0 0
No ratings yet
Durexforth-V2 0 0
32 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
M202101 R Tools For JDemetra
No ratings yet
M202101 R Tools For JDemetra
28 pages
2024 7487 Moesm1 Esm
No ratings yet
2024 7487 Moesm1 Esm
44 pages
Stamps/Mti Manual: Version 3.3B1
No ratings yet
Stamps/Mti Manual: Version 3.3B1
36 pages
Xdice Documentation: Release 1.0.0
No ratings yet
Xdice Documentation: Release 1.0.0
17 pages
Learneverythingai 1661068200
No ratings yet
Learneverythingai 1661068200
66 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Pandasguide
No ratings yet
Pandasguide
65 pages
Stamps/Mti Manual
No ratings yet
Stamps/Mti Manual
35 pages
AutoMapper Documentation
100% (1)
AutoMapper Documentation
79 pages
cs61c Notes
No ratings yet
cs61c Notes
29 pages
R Basics RStudio
No ratings yet
R Basics RStudio
14 pages
Stamps
No ratings yet
Stamps
44 pages
Pandas Guide
No ratings yet
Pandas Guide
65 pages
ScRNA Seq Course
100% (1)
ScRNA Seq Course
337 pages
Equalization OpenGL
No ratings yet
Equalization OpenGL
122 pages
Cubicsdr Documentation: Release Latest
No ratings yet
Cubicsdr Documentation: Release Latest
14 pages
Snoopy
0% (1)
Snoopy
24 pages
user_guide_vtl3d
No ratings yet
user_guide_vtl3d
11 pages
Cook P. D3 Start To Finis
No ratings yet
Cook P. D3 Start To Finis
260 pages
L TEX Maths and Graphics: A Tim Love July 27, 2006
No ratings yet
L TEX Maths and Graphics: A Tim Love July 27, 2006
16 pages
Pandasguide Readthedocs Io en Latest PDF
No ratings yet
Pandasguide Readthedocs Io en Latest PDF
65 pages
cs246 PDF
No ratings yet
cs246 PDF
82 pages
Staxrip Plugins
No ratings yet
Staxrip Plugins
109 pages
The Bitset Package: Heiko Oberdiek 2019/12/09 v1.3
No ratings yet
The Bitset Package: Heiko Oberdiek 2019/12/09 v1.3
45 pages
Gams Users Guide PDF
No ratings yet
Gams Users Guide PDF
293 pages
SQL Notes
No ratings yet
SQL Notes
25 pages
Pymodbus Readthedocs Io en Stable
No ratings yet
Pymodbus Readthedocs Io en Stable
291 pages
R Graphics Essentials For Great Data Visualization
No ratings yet
R Graphics Essentials For Great Data Visualization
28 pages
SAT SMT by Example PDF
No ratings yet
SAT SMT by Example PDF
575 pages
Nono
No ratings yet
Nono
215 pages
Porting QEMU To Plan 9: QEMU Internals and Port Strategy: Nathaniel Wesley Filardo September 11, 2007
No ratings yet
Porting QEMU To Plan 9: QEMU Internals and Port Strategy: Nathaniel Wesley Filardo September 11, 2007
26 pages
Akka Scala
No ratings yet
Akka Scala
771 pages
Gretl User's Guide
No ratings yet
Gretl User's Guide
318 pages
DevIL Manual
No ratings yet
DevIL Manual
41 pages
Bookdown Demo PDF
No ratings yet
Bookdown Demo PDF
19 pages
Raspberry Pi Intro Readthedocs Io en Latest
No ratings yet
Raspberry Pi Intro Readthedocs Io en Latest
37 pages
Zero2prod With Cover Light Theme 20211228
No ratings yet
Zero2prod With Cover Light Theme 20211228
317 pages
Boom Spec PDF
No ratings yet
Boom Spec PDF
70 pages
Differential Analysis of Count Data - The Deseq2 Package: Michael Love, Simon Anders, Wolfgang Huber
No ratings yet
Differential Analysis of Count Data - The Deseq2 Package: Michael Love, Simon Anders, Wolfgang Huber
33 pages
Predicting The Outcome of CS:GO Games Using Machine Learning
No ratings yet
Predicting The Outcome of CS:GO Games Using Machine Learning
49 pages
Automapper Documentation: Jimmy Bogard
No ratings yet
Automapper Documentation: Jimmy Bogard
83 pages
Mako Tutorial
No ratings yet
Mako Tutorial
106 pages
Generating Embedded C Code For Digital Signal Processing
No ratings yet
Generating Embedded C Code For Digital Signal Processing
111 pages
Computer Science Coursework: DeepLearning Library
No ratings yet
Computer Science Coursework: DeepLearning Library
110 pages
2017 Machine Learning Summary v4 PDF
No ratings yet
2017 Machine Learning Summary v4 PDF
41 pages
Machine Learning A Z Q A
100% (1)
Machine Learning A Z Q A
52 pages
FCE Review (Quiz3)
No ratings yet
FCE Review (Quiz3)
14 pages
Mcu Klinke Manual PDF
No ratings yet
Mcu Klinke Manual PDF
25 pages
Tutorial
No ratings yet
Tutorial
17 pages
Robotic Engineering 2: ROS-Turtlebot Motion Control and Navigation
No ratings yet
Robotic Engineering 2: ROS-Turtlebot Motion Control and Navigation
34 pages
Software Management: Egor Pugin
No ratings yet
Software Management: Egor Pugin
23 pages
Digital Audio Signal Processing
From Everand
Digital Audio Signal Processing
Udo Zölzer
No ratings yet
Presentations with LaTeX: Which package, which command, which syntax?
From Everand
Presentations with LaTeX: Which package, which command, which syntax?
Herbert Voß
No ratings yet
Open Data Structures: An Introduction
From Everand
Open Data Structures: An Introduction
Pat Morin
4/5 (4)
IM Underground Drilling June 2022 1655491374
No ratings yet
IM Underground Drilling June 2022 1655491374
12 pages
IM Pumps and Pipelines June 2022 1655207720
No ratings yet
IM Pumps and Pipelines June 2022 1655207720
10 pages
7 Quick Tips To #Win You More Business From Linkedin.... : #Stewartsays
No ratings yet
7 Quick Tips To #Win You More Business From Linkedin.... : #Stewartsays
10 pages
Act Like Owners: Lavish Jindal
No ratings yet
Act Like Owners: Lavish Jindal
2 pages
Variogram Tutorial: Randal Barnes Golden Software, Inc
No ratings yet
Variogram Tutorial: Randal Barnes Golden Software, Inc
23 pages
Lecture17 Aloha Ethernet
No ratings yet
Lecture17 Aloha Ethernet
15 pages
Session 10 Information Retrieval and Dissemination
No ratings yet
Session 10 Information Retrieval and Dissemination
34 pages
Angular HTTP
No ratings yet
Angular HTTP
10 pages
BSC - CSD - 2024 - Scheme and Syllabus
No ratings yet
BSC - CSD - 2024 - Scheme and Syllabus
48 pages
HDD and SSD - Comparative Study For IGCSE
No ratings yet
HDD and SSD - Comparative Study For IGCSE
7 pages
Specifications: Universal Power Cells
No ratings yet
Specifications: Universal Power Cells
4 pages
Affidavit of Witness Ruffers Osoteo
100% (1)
Affidavit of Witness Ruffers Osoteo
2 pages
Midterm Report 1
No ratings yet
Midterm Report 1
21 pages
Creating An Android App With Android Studio
No ratings yet
Creating An Android App With Android Studio
12 pages
Recording Web Applications
No ratings yet
Recording Web Applications
9 pages
Chords For Arum Dalu-Adif Marhaendra (Cover Rocky Chords For Arum Dalu-Adif Marhaendra (Cover Rocky DS) DS)
No ratings yet
Chords For Arum Dalu-Adif Marhaendra (Cover Rocky Chords For Arum Dalu-Adif Marhaendra (Cover Rocky DS) DS)
2 pages
Digital SAT Prep in Florida
100% (1)
Digital SAT Prep in Florida
18 pages
NVR AS4000 UserGuide PDF
No ratings yet
NVR AS4000 UserGuide PDF
24 pages
AE-DC4328-K5 Dashboard Camera: Key Features
No ratings yet
AE-DC4328-K5 Dashboard Camera: Key Features
3 pages
NRJED324204EN_Print
No ratings yet
NRJED324204EN_Print
58 pages
Cyber Defense Magazine - February 2023 PDF
No ratings yet
Cyber Defense Magazine - February 2023 PDF
186 pages
Axis Media Control Um
No ratings yet
Axis Media Control Um
27 pages
8 - Ensemble Tree Machine Learning Models For Improvement of Eurocode2 Creep Model Prediction (CEE 2022)
No ratings yet
8 - Ensemble Tree Machine Learning Models For Improvement of Eurocode2 Creep Model Prediction (CEE 2022)
11 pages
Basic Commands On Mongo Shell
No ratings yet
Basic Commands On Mongo Shell
5 pages
E49-900MBL-01_UserManual_EN_v1.1 (1)
No ratings yet
E49-900MBL-01_UserManual_EN_v1.1 (1)
15 pages
HCM – Basic Understanding of HDL – Worker.dat
No ratings yet
HCM – Basic Understanding of HDL – Worker.dat
19 pages
Coordination and Agreement: Distributed Systems
No ratings yet
Coordination and Agreement: Distributed Systems
37 pages
Dell Technologies Hyperconverged Solutions: Modernize Your Infrastructure and Accelerate IT Transformation
No ratings yet
Dell Technologies Hyperconverged Solutions: Modernize Your Infrastructure and Accelerate IT Transformation
21 pages
sm-g570f SVC Guide - Rev2.8
No ratings yet
sm-g570f SVC Guide - Rev2.8
27 pages
Sequencing Week 2 Above
No ratings yet
Sequencing Week 2 Above
3 pages
DWM Unit 1. Introduction To Data Warehousing
100% (4)
DWM Unit 1. Introduction To Data Warehousing
12 pages
Lastexception 63793168031
No ratings yet
Lastexception 63793168031
1 page
6G Network Seminar Report NITHIN
No ratings yet
6G Network Seminar Report NITHIN
21 pages