Logistic Regression in Python Using Dask
Logistic Regression in Python Using Dask
1 Introduction 1
1.1 Overview of dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How to install Dask? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Spam data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Logistic regression 7
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 About logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Logistic regression in dask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Prediction on the test sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Confusion matrix and assessment indicators . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Log loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Precision Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Recall Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.5 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.6 Classification report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.7 Roc Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.8 Lift Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Bibliography 17
I
1 Introduction
1.1 Overview of dask
The main objective of this paper is to give you a brief overview of Dask, a parallel computing
library for Python. Dask paralyzes many libraries in Python ecosystem like Numpy, Pandas,
Scikit-Learn and many others. Dask allows them to scale either on your laptop with multi-core
and largely memory parallelism or on large distributed clusters on the cloud or otherwise. All
while providing consistent user experience that stays true to the existing Python community of
projects. Libraries like numpy, pandas and scikit-learn are popular today because they combine
high-level usable and flexible API with high performance implementations.
Here is an example of an 3D-array filled with ones with shape (100, 100, 1000).
import numpy as np
...,
1
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
However these were not originally designed to scale beyond a single CPU or to data that
doesn’t fit in memory. Let change the first size of the 3D-array to 10000.
import numpy as np
This often results in memory errors or switching libraries when you run into larger data sets.
This is what Dask can help fix. If you replace the numpy library with dask.array which uses an
umpire under the hood then everything works well.
import dask.array as da
2
Dask can use numpy and scale it out to multi-core machines or large distributed clusters. By
integrating with existing libraries. Dask enables developers and data sientists to easily transition
from traditional single machine workflows to parallel and distributed computing whitout learning
new framework or word writing much code. This can be done anywhere we write Python including
other libraries automated scripts or Jupyter Notebooks. By default, Dask is lazy and to visualize
data which are in a variable, you have to add .compute() at the end of the variable. For more
information about dask, visit the dask home page (https://dask.org/).
In this tutorial, from an example of identifying ”spam”, we will show how to compute logistic
regression using Dask packages.
print(pd.__version__)
1.3.5
# Data loading
df = pd.read_csv('spam.csv')
# Data shape
print(df.shape)
(4601, 58)
3
# info
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 4601 non-null float64
1 address 4601 non-null float64
2 all 4601 non-null float64
3 num3d 4601 non-null float64
4 our 4601 non-null float64
5 over 4601 non-null float64
6 remove 4601 non-null float64
7 internet 4601 non-null float64
8 order 4601 non-null float64
9 mail 4601 non-null float64
10 receive 4601 non-null float64
11 will 4601 non-null float64
12 people 4601 non-null float64
13 report 4601 non-null float64
14 addresses 4601 non-null float64
15 free 4601 non-null float64
16 business 4601 non-null float64
17 email 4601 non-null float64
18 you 4601 non-null float64
19 credit 4601 non-null float64
20 your 4601 non-null float64
21 font 4601 non-null float64
22 num000 4601 non-null float64
23 money 4601 non-null float64
24 hp 4601 non-null float64
25 hpl 4601 non-null float64
26 george 4601 non-null float64
27 num650 4601 non-null float64
28 lab 4601 non-null float64
29 labs 4601 non-null float64
30 telnet 4601 non-null float64
31 num857 4601 non-null float64
32 data 4601 non-null float64
33 num415 4601 non-null float64
34 num85 4601 non-null float64
35 technology 4601 non-null float64
36 num1999 4601 non-null float64
37 parts 4601 non-null float64
38 pm 4601 non-null float64
39 direct 4601 non-null float64
40 cs 4601 non-null float64
41 meeting 4601 non-null float64
4
42 original 4601 non-null float64
43 project 4601 non-null float64
44 re 4601 non-null float64
45 edu 4601 non-null float64
46 table 4601 non-null float64
47 conference 4601 non-null float64
48 charSemicolon 4601 non-null float64
49 charRoundbracket 4601 non-null float64
50 charSquarebracket 4601 non-null float64
51 charExclamation 4601 non-null float64
52 charDollar 4601 non-null float64
53 charHash 4601 non-null float64
54 capitalAve 4601 non-null float64
55 capitalLong 4601 non-null int64
56 capitalTotal 4601 non-null int64
57 type 4601 non-null object
dtypes: float64(55), int64(2), object(1)
memory usage: 2.0+ MB
None
The last variable, type, represent the target variable. All predictor variables are quantitative.
# description
print(df['type'].value_counts())
The data set contains 2788 e-mails classified as ”nonspam” and 1813 classified as ”spam”. We
compute the relative frequence by type of e-mail.
# relatives frequences
print(df['type'].value_counts(normalize=True))
nonspam 0.605955
spam 0.394045
Name: type, dtype: float64
39.4% of e-mails are spam (type=spam) and 60.6% aren’t spam (type=nonspam). In the worst
case, if we classify all messages as type = spam (null model or default classifier), our success rate
(or good classification rate) will be 60%. We should do better with the different methods that we
will study.
Before splitting the data, we will first encode the target variable to get neither nonspam/spam
but 0/1. Dask has the LabelEncoder function to encode labels with value between 0 and n classes-
1.
# Labelencoder
from dask_ml.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['type'] = encoder.fit_transform(df['type'])
5
1.3.1 Data splitting
As scikit-learn, dask has the train test split for splitting arrays (or data frame) into random
train and test matrices (or data frames). The main difference between the two is that dask doesn’t
have the stratify parameters as sklearn. We require 0.7(70%) of indivduals for learning.
# Data splitting
from dask_ml.model_selection import train_test_split
DTrain, DTest = train_test_split(df, train_size = .7, random_state=2021)
Shape :
- Train : (3220, 58)
- Test : (1381, 58)
# In train set
print(DTrain['type'].value_counts(normalize=True))
0 0.612112
1 0.387888
Name: type, dtype: float64
# In test set
print(DTest['type'].value_counts(normalize=True))
0 0.5916
1 0.4084
Name: type, dtype: float64
There is a slight non-significant difference in the distribution of data between the training
sample and the test sample.
6
2 Logistic regression
Regression methods have become an integral component of any data analysis concerned with
describing the relationship between a response variable and one or more explanatory variables.
It is often the case that the outcome variable is discrete, taking on two more possible values. It’s
classification.
2.1 Classification
The classification based tasks are a sub-field under supervised Machine Learning, where the key
objective is to predict output labels or responses that are categorical in nature for input data
based on what the model has learned in the training phase. Ouput labels here are also known as
classes or class labels are these are categorical in nature meaning they are unordered and discrete
values. Thus, each output response belongs to a specific discrete class or category.
Over the last decade the logistic regression model has become, in many fields, the standard
method of analysis in the situation.
There is a good reason for this popularity. Unlike traditional linear or normal regression,
logistic regression is appropriate for modeling a binary variable. A binary variable has only two
walues – 1 and 0. These values may be thought of as success and failure, or of any other type
of positive and non-positive dichotomy.
In this paper, we will show how to perform logistic regression using Dask packages combined
with sklearn. As we have see in the introduction, our data has predicted variable - type - which
is binary and 57 predictor variables.
7
LogisticRegression(solver='lbfgs')
• .intercept_ : a float. The learned value for the intercept, if one was added to the model.
By default fit intercept set to True.
# coefficients
coef = pd.DataFrame({'coef.': model.coef_})
coef.index = XTrain.columns
print(coef)
coef.
make -0.479880
address -0.150682
all 0.135193
num3d 0.541169
our 0.659910
over 0.596929
remove 2.251704
internet 0.625953
order 0.336239
mail 0.094922
receive -0.217522
will -0.200684
people -0.003069
report 0.071286
addresses 2.528279
free 1.056557
business 0.933110
email 0.089965
you 0.064004
credit 0.611418
your 0.179545
font 0.183953
num000 2.176546
money 0.815767
hp -1.510591
hpl -1.755439
george -1.098980
num650 0.363375
lab -1.765462
labs -0.189831
telnet -1.436668
num857 -1.367995
data -0.529982
num415 0.042820
num85 -1.568667
technology 0.650104
8
num1999 0.042632
parts 0.521060
pm -0.930480
direct -0.649896
cs -3.732531
meeting -1.867262
original -1.280181
project -1.049496
re -0.791601
edu -1.063858
table -2.094519
conference -2.914653
charSemicolon -1.203818
charRoundbracket -0.152945
charSquarebracket -0.512236
charExclamation 0.270677
charDollar 5.325154
charHash 1.786209
capitalAve 0.030869
capitalLong 0.008978
capitalTotal 0.000340
# intercept
print(f"""Intercept :
{model.intercept_}""")
Intercept :
-1.5461005587075771
[False False True False False True True False False True False True
True False False False True True True True False False True False
True False False False True False True False False False False False
False False True True True True False False True False False False
True False False False True False False False False False False True]
9
# dask array
import dask.array as da
yPred = da.where(ypred==True,1,0)
print(yPred.compute()[:60])
[0 0 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0
0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1]
# relative distribution
print(pd.DataFrame({"prediction":yPred}).value_counts())
prediction
0 855
1 526
dtype: int64
There are 526 positive predictions i.e. 526 emails from the test sample are designated spam.
# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat = pd.DataFrame(confusion_matrix(yTest, yPred),
index = ["nonspam", "spam"],
columns = ["nonspam", "spam"])
print(conf_mat)
nonspam spam
nonspam 781 36
spam 74 490
10
Figure 1: Confusion matrix
The predicted data results in the above diagram could be read in the following manner given
1 represents spam e-mail.
• True Positive (TP): True positive represents the value of correct predictions of positives
out of actual positive cases. Out of 564 actual positive, 490 is correctly predicted positive.
Thus, the value of True Positive is 490.
• False Positive (FP): False positive represents the value of incorrect positive predictions.
This value represents the number of nonspam (out of 817) which gets falsely predicted as
spam. Out of 817 actual nonspam, 36 is falsely predicted as spam. Thus, the value of False
Positive is 36.
• True Negative (TN): True negative represents the value of correct predictions of nonspam
out of actual nonspam cases. Out of 817 actual nonspam, 781 is correctly predicted nonspam.
Thus, the value of True Negative is 781.
• False Negative (FN): False negative represents the value of incorrect nonspam predictions.
This value represents the number of spam (out of 564) which gets falsely predicted as non-
spam. Out of 564 actual spam, 74 is falsely predicted as nonspam. Thus, the value of False
Negative is 74.
In terms of classification metrics, dask has only two metrics: accuracy score and log loss.
11
out of the total number of times it made predictions. For example: Let’s assume that you were
testing your machine learning model with a dataset of 100 records and that your machine learning
model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be:
(90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our
machine learning models make on new data we haven’t seen before.
Mathematically, it represents the ratio of the sum of true positive and true negatives out of
all the predictions.
TP + TN
Accuracy Score =
TP + FN + TN + FP
The same score can be obtained by using accuracy score method from dask ml.metrics.
# accuracy
from dask_ml.metrics import accuracy_score
While training a classification model, we would want the observation to be predicted with
probability as close to the actual value (of 0 or 1) as possible. Hence, log-loss turns out to be
a good choice for a loss function during training and optimizing classification models, wherein
farther away the prediction probability from its true value is, higher the prediction is penalized.
# log loss
from dask_ml.metrics import log_loss
2.5.5 F1-Score
Model F1 score represents the model score as a function of precision and recall score. F-score is
a machine learning model performance metric that gives equal weight to both the Precision and
Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy
metrics (it doesn’t require us to know the total number of observations). It’s often used as a
single value that provides high-level information about the model’s output quality. This is a
useful measure of the model in the scenarios where one tries to optimize either of precision or
recall score and as a result, the model performance suffers. The following represents the aspects
relating to issues with optimizing either precision or recall score:
• Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer.
However, this comes at the cost of predicting malignant cancer in patients although the
patients are healthy (a high number of FP).
• Optimize for precision helps with correctness if the patient has a malignant cancer. How-
ever, this comes at the cost of missing malignant cancer more frequently (a high number of
FN).
13
2.5.6 Classification report
A classification report is a performance evaluation metric in machine learning. It is used to show
the precision, recall, F1 Score, and support of your trained classification model. It is one of the
performance evaluation metrics of a classification-based machine learning model. It displays your
model’s precision, recall, F1 score and support. It provides a better understanding of the overall
performance of our trained model.
# classification report
from sklearn.metrics import classification_report
print(classification_report(yTest, yPred,
target_names = ["nonspam", "spam"]))
When using ROC analysis, the analyst should look at both the ROC statistic as well as at the
plot of the sensitivity versus one minus the specificity. A model with no predictive power has a
slope of 1. This represents an ROC statistic of 0.5.
• Values from 0.65 to 0.80 have moderate predictive value. Many logistic models fit into this
range.
• Values greater than 0.8 and less than 0.9 are generally regarded as having strong predictive
power.
• Values of 0.9 and greater indicate the highest amount of predictive power, but models rarely
achieve values in this range.
The model is a better classifier with greater values of the ROC statistic, or area under the
curve (AUC). Beware of over-fitting with such models. Validating the model with a validation
sample or samples is recommended.
# ROC Curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
14
fpr, tpr, thresholds = roc_curve(yTest, score[:,1])
auc = roc_auc_score(yTest, yPred)
score = model.predict_proba(XTest.values)
import dask.dataframe as dd
dataframe = pd.DataFrame({'score':score[:,1]})
dask_df = dd.from_pandas(dataframe, npartitions=10)
print(dask_df.compute())
15
score
0 6.323558e-11
1 6.555002e-03
2 9.881983e-01
3 6.976487e-08
4 3.093341e-02
... ...
1376 2.407358e-01
1377 1.725983e-03
1378 4.323683e-01
1379 9.085397e-01
1380 6.323558e-11
# quantile
print(dask_df.quantile(q=[0, .25, .5, .75, 1.0]).compute().to_frame().T)
We call the scikitplot package to build the lift curve. We use the plot cumulative gain()
function.
# lift curve
import matplotlib.pyplot as plt
import scikitplot as skplt
skplt.metrics.plot_cumulative_gain(y_true=yTest, y_probas=score)
plt.show()
The curve is close to the theoretical limit (reached when all type=1 is assigned a higher score
than type = 0). Our targeting is of excellent quality.
16
3 Bibliography
1. Andreas C. Müller & Sarah Guido, Introduction to Machine Learning with Python : A
guide for Data Scientists.
4. David G. Kleinbaum & Mitchel Klein, Logistic Regression : A Self-Learning Text, Third
Edition.
5. David W. Hosmer & Stanley Lemeshow, Applied Logistic Regression, Second Edition.
6. Dipanjan Sarkar, Raghav Bali & Tushar Sharma, Practical Machine Learning with Python
: A Problem-Solver’s Guide to Building Real-World Intelligent Systems.
17