0% found this document useful (0 votes)

18 views10 pages

Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science

Uploaded by

largeminkyu

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

18 views10 pages

Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science

Uploaded by

largeminkyu

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 10

Published in Towards Data Science

Susan Li Follow

May 6, 2018 · 6 min read

Machine Learning with PySpark and MLlib —

Solving a Binary Classification Problem

Photo Credit: Pixabay

Apache Spark, once a component of the Hadoop ecosystem, is now becoming

the big-data platform of choice for enterprises. It is a powerful open source
engine that provides real-time stream processing, interactive processing,
graph processing, in-memory processing as well as batch processing with
very fast speed, ease of use and standard interface.

In the industry, there is a big demand for a powerful engine that can do all of
above. Sooner or later, your company or your clients will be using Spark to
develop sophisticated models that would enable you to discover new
opportunities or avoid risk. Spark is not hard to learn, if you already known
Python and SQL, it is very easy to get started. Let’s give it a try today!

Exploring The Data

We will use the same data set when we built a Logistic Regression in Python,
and it is related to direct marketing campaigns (phone calls) of a Portuguese
banking institution. The classification goal is to predict whether the client will
subscribe (Yes/No) to a term deposit. The dataset can be downloaded from
Kaggle.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.csv('bank.csv', header = True, inferSchema = True)
df.printSchema()
Figure 1

Input variables: age, job, marital, education, default, balance, housing, loan,
contact, day, month, duration, campaign, pdays, previous, poutcome.

Output variable: deposit

Have a peek of the first five observations. Pandas data frame is prettier than
Spark DataFrame.show().

import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()

Figure 2

Our classes are perfect balanced.

import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()

Figure 3
Summary statistics for numeric variables

numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']

df.select(numeric_features).describe().toPandas().transpose()

Figure 4

Correlations between independent variables.

numeric_data = df.select(numeric_features).toPandas()
axs = pd.scatter_matrix(numeric_data, figsize=(8, 8));
n = len(numeric_data.columns)
for i in range(n):
v = axs[i, 0]
v.yaxis.label.set_rotation(0)
v.yaxis.label.set_ha('right')
v.set_yticks(())
h = axs[n-1, i]
h.xaxis.label.set_rotation(90)
h.set_xticks(())

Figure 5

It’s obvious that there aren’t highly correlated numeric variables. Therefore,
we will keep all of them for the model. However, day and month columns are
not really useful, we will remove these two columns.

df = df.select('age', 'job', 'marital', 'education', 'default',

'balance', 'housing', 'loan', 'contact', 'duration', 'campaign',
'pdays', 'previous', 'poutcome', 'deposit')
cols = df.columns
df.printSchema()

Figure 6

Preparing Data for Machine Learning

The process includes Category Indexing, One-Hot Encoding and
VectorAssembler — a feature transformer that merges multiple columns into
a vector column.

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,

VectorAssembler

categoricalColumns = ['job', 'marital', 'education', 'default',

'housing', 'loan', 'contact', 'poutcome']
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol = categoricalCol,
outputCol = categoricalCol + 'Index')
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol +
"classVec"])
stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol =

'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance', 'duration', 'campaign', 'pdays',
'previous']
assemblerInputs = [c + "classVec" for c in categoricalColumns] +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs,
outputCol="features")
stages += [assembler]

The above code are taken from databricks’ official site and it indexes each
categorical column using the StringIndexer, then converts the indexed
categories into one-hot encoded variables. The resulting output has the
binary vectors appended to the end of each row. We use the StringIndexer
again to encode our labels to label indices. Next, we use the VectorAssembler
to combine all the feature columns into a single vector column.

Pipeline

We use Pipeline to chain multiple Transformers and Estimators together to

specify our machine learning workflow. A Pipeline’s stages are specified as an
ordered array.

from pyspark.ml import Pipeline

pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
df.printSchema()

Figure 7

pd.DataFrame(df.take(5), columns=df.columns).transpose()

Figure 8

As you can see, we now have features column and label column.

Randomly split data into train and test sets, and set seed for reproducibility.

Search Medium
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 7764

Test Dataset Count: 3398

Logistic Regression Model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label',
maxIter=10)
lrModel = lr.fit(train)
We can obtain the coefficients by using LogisticRegressionModel’s
attributes.

import matplotlib.pyplot as plt

import numpy as np
beta = np.sort(lrModel.coefficients)
plt.plot(beta)
plt.ylabel('Beta Coefficients')
plt.show()

Figure 9

Summarize the model over the training set, we can also obtain the receiver-
operating characteristic and areaUnderROC.

trainingSummary = lrModel.summary

roc = trainingSummary.roc.toPandas()
plt.plot(roc['FPR'],roc['TPR'])
plt.ylabel('False Positive Rate')
plt.xlabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

print('Training set areaUnderROC: ' +

str(trainingSummary.areaUnderROC))

Figure 10

Precision and recall.

pr = trainingSummary.pr.toPandas()
plt.plot(pr['recall'],pr['precision'])
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.show()
Figure 11

Make predictions on the test set.

predictions = lrModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)

Figure 12

Evaluate our Logistic Regression model.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))

Test Area Under ROC 0.8858324614449619

Decision Tree Classifier

Decision trees are widely used since they are easy to interpret, handle
categorical features, extend to the multi-class classification, do not require
feature scaling, and are able to capture non-linearities and feature
interactions.

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol = 'features', labelCol =

'label', maxDepth = 3)
dtModel = dt.fit(train)
predictions = dtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)
Figure 13

Evaluate our Decision Tree model.

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.7807240050065357

One simple decision tree performed poorly because it is too weak given the
range of different features. The prediction accuracy of decision trees can be
improved by Ensemble methods, such as Random Forest and Gradient-
Boosted Tree.

Random Forest Classifier

from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol = 'features', labelCol =
'label')
rfModel = rf.fit(train)
predictions = rfModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)

Figure 14

Evaluate our Random Forest Classifier.

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8846453518867426

Gradient-Boosted Tree Classifier

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)

Figure 15

Evaluate our Gradient-Boosted Tree Classifier.

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))

Test Area Under ROC: 0.8940728473145346

Gradient-Boosted Tree achieved the best results, we will try tuning this
model with the ParamGridBuilder and the CrossValidator. Before that we can
use explainParams() to print a list of all params and their definitions to
understand what params available for tuning.

print(gbt.explainParams())

Figure 16

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxDepth, [2, 4, 6])
.addGrid(gbt.maxBins, [20, 60])
.addGrid(gbt.maxIter, [10, 20])
.build())
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=5)

# Run cross validations. This can take about 6 minutes since it is

training over 20 trees!
cvModel = cv.fit(train)
predictions = cvModel.transform(test)
evaluator.evaluate(predictions)

0.8981050997838095

To sum it up, we have learned how to build a binary classification application

using PySpark and MLlib Pipelines API. We tried four algorithms and
gradient boosting performed best on our data set.

Source code can be found on Github. I look forward to hearing feedback or

questions.

Reference: Apache Spark 2.1.0

Machine Learning Apache Spark Python Predictive Analytics Pyspark

About Help Terms Privacy

St. Louis Anne Colleges of San Pedro,: Old Nat'l Highway, Brgy. Nueva, San Pedro City, Laguna
No ratings yet
St. Louis Anne Colleges of San Pedro,: Old Nat'l Highway, Brgy. Nueva, San Pedro City, Laguna
18 pages
PR
No ratings yet
PR
17 pages
Machine Learning LAB: Practical-1
100% (2)
Machine Learning LAB: Practical-1
24 pages
Machine Learning With SQL
100% (1)
Machine Learning With SQL
12 pages
ML Manual Final
No ratings yet
ML Manual Final
35 pages
Rainfall Prediction using Machine Learning
No ratings yet
Rainfall Prediction using Machine Learning
9 pages
ML MANUAL WITH OUTPUTS (2)
No ratings yet
ML MANUAL WITH OUTPUTS (2)
30 pages
Recsify Technologies Assignment
No ratings yet
Recsify Technologies Assignment
10 pages
lab-5-nguyenngocmaithi-20130120
No ratings yet
lab-5-nguyenngocmaithi-20130120
20 pages
AI ML - Cycle 2 Programs (1)
No ratings yet
AI ML - Cycle 2 Programs (1)
15 pages
Handwritten Character Recognition With Neural Network
No ratings yet
Handwritten Character Recognition With Neural Network
12 pages
If With: February 26, 2024
No ratings yet
If With: February 26, 2024
7 pages
Machine Failure Prediction
No ratings yet
Machine Failure Prediction
11 pages
Practical 5
No ratings yet
Practical 5
11 pages
Machine Learning Hands-On
100% (1)
Machine Learning Hands-On
18 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
ML Practicals
No ratings yet
ML Practicals
11 pages
ML-FINANCE - NPTES_BSR
No ratings yet
ML-FINANCE - NPTES_BSR
36 pages
Ml Solution
No ratings yet
Ml Solution
60 pages
Data Mining & Data Science Practical Slips
No ratings yet
Data Mining & Data Science Practical Slips
45 pages
ML JOURNAL
No ratings yet
ML JOURNAL
53 pages
ML Remaining
No ratings yet
ML Remaining
17 pages
Chapter 9 BTC PRICE PRED
No ratings yet
Chapter 9 BTC PRICE PRED
12 pages
Edx Course Lab Programs
No ratings yet
Edx Course Lab Programs
19 pages
Activity 4 CGPA Vs Placement Package Program
No ratings yet
Activity 4 CGPA Vs Placement Package Program
4 pages
som
No ratings yet
som
19 pages
ML Report Miniproject
No ratings yet
ML Report Miniproject
11 pages
Data Analytics Program
No ratings yet
Data Analytics Program
11 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Maxbox - Starter67 Machine Learning
No ratings yet
Maxbox - Starter67 Machine Learning
7 pages
ML Foram
No ratings yet
ML Foram
17 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
ML File
No ratings yet
ML File
37 pages
P06 The Classification Pipeline Ans
No ratings yet
P06 The Classification Pipeline Ans
16 pages
House Pricing
No ratings yet
House Pricing
15 pages
Project Notes
No ratings yet
Project Notes
2 pages
Machine
100% (1)
Machine
45 pages
Correction
No ratings yet
Correction
3 pages
DL Practical PROGRAM
No ratings yet
DL Practical PROGRAM
28 pages
exp 9-10
No ratings yet
exp 9-10
6 pages
ZFNet For CIFAR-10 Classification
No ratings yet
ZFNet For CIFAR-10 Classification
33 pages
Final Code
No ratings yet
Final Code
16 pages
C1 W2 Lab05 Sklearn GD Soln
No ratings yet
C1 W2 Lab05 Sklearn GD Soln
3 pages
DM Practice
No ratings yet
DM Practice
15 pages
ML Lab Programs For Exam
No ratings yet
ML Lab Programs For Exam
10 pages
ML lab manual
No ratings yet
ML lab manual
25 pages
Rajeek8 12
No ratings yet
Rajeek8 12
21 pages
CO3
No ratings yet
CO3
8 pages
21brs1474 ML Lab 2
No ratings yet
21brs1474 ML Lab 2
25 pages
ABHAYMLFILE
No ratings yet
ABHAYMLFILE
16 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
CH4.ipynb - Colab
No ratings yet
CH4.ipynb - Colab
8 pages
Appendix PDF
No ratings yet
Appendix PDF
5 pages
DEEP LEARNING EXPERIMENTS
No ratings yet
DEEP LEARNING EXPERIMENTS
42 pages
DM Slip Solutions
100% (1)
DM Slip Solutions
24 pages
bot
No ratings yet
bot
1 page
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
ML Lab File Final.docx - Google Docs
No ratings yet
ML Lab File Final.docx - Google Docs
17 pages
Hands-On AI: Building ML Models with Python
From Everand
Hands-On AI: Building ML Models with Python
Anand Vemula
No ratings yet
Multisync 100: 1588 Gps Clock
No ratings yet
Multisync 100: 1588 Gps Clock
53 pages
CE411 Mahmud
No ratings yet
CE411 Mahmud
111 pages
Get .CRX Chrome Extension File, It's Source Code and Download .CRX To Your Computer
No ratings yet
Get .CRX Chrome Extension File, It's Source Code and Download .CRX To Your Computer
5 pages
Texas A&M University MKT 436 Marketing Research
No ratings yet
Texas A&M University MKT 436 Marketing Research
12 pages
Understanding Cross Site Scripting XSS
No ratings yet
Understanding Cross Site Scripting XSS
8 pages
Delete Location
No ratings yet
Delete Location
7 pages
Citrix Receiver
No ratings yet
Citrix Receiver
86 pages
HCIP-5G-RAN V1.0 Exam Outline
No ratings yet
HCIP-5G-RAN V1.0 Exam Outline
2 pages
Adaptive Equalization
No ratings yet
Adaptive Equalization
3 pages
BHAGYASHREE
No ratings yet
BHAGYASHREE
12 pages
Cybersecurity Awareness Training Presentation v2021!08!210819202742
No ratings yet
Cybersecurity Awareness Training Presentation v2021!08!210819202742
47 pages
Immediate Download EdPsych Modules 3rd Edition Durwin Test Bank All Chapters
100% (3)
Immediate Download EdPsych Modules 3rd Edition Durwin Test Bank All Chapters
16 pages
Working With Dates in Pandas: Prepared by Asif Bhat
No ratings yet
Working With Dates in Pandas: Prepared by Asif Bhat
13 pages
Finefurs
No ratings yet
Finefurs
14 pages
Dell Training and Certification Learning Path Storage Networking En-Us
No ratings yet
Dell Training and Certification Learning Path Storage Networking En-Us
4 pages
CS150 - Unit 4a - Analysis and Design Tools
No ratings yet
CS150 - Unit 4a - Analysis and Design Tools
15 pages
Advanced Computer Networks - MSC - 2023 - 24 Entry - Birmingham City University
No ratings yet
Advanced Computer Networks - MSC - 2023 - 24 Entry - Birmingham City University
21 pages
Artificial Intelligent Approach To Predict The Student Behaviour and Performance
No ratings yet
Artificial Intelligent Approach To Predict The Student Behaviour and Performance
11 pages
EP0920945A2
No ratings yet
EP0920945A2
17 pages
ASM - Notes 5 (Interrupts)
No ratings yet
ASM - Notes 5 (Interrupts)
10 pages
Mitel 5000 CP v5.0 Voice Mail Administrator Guide PDF
No ratings yet
Mitel 5000 CP v5.0 Voice Mail Administrator Guide PDF
70 pages
23-Anaphora Resolution-03-10-2024
No ratings yet
23-Anaphora Resolution-03-10-2024
46 pages
FreeNAS INICIAL
No ratings yet
FreeNAS INICIAL
337 pages
White Paper Mindmap PDF
No ratings yet
White Paper Mindmap PDF
5 pages
Submitted By:-Farida Poonawala (37) Ramiza Tole (54) Maria Udaipurwala (55) Submitted TO: - Prof. Neha Patel
0% (1)
Submitted By:-Farida Poonawala (37) Ramiza Tole (54) Maria Udaipurwala (55) Submitted TO: - Prof. Neha Patel
22 pages
ISO IEC TS 29125-2017
No ratings yet
ISO IEC TS 29125-2017
38 pages
Chapter 1
No ratings yet
Chapter 1
11 pages
Abstract - PPT (Agricultural Commodities Price Prediction)
No ratings yet
Abstract - PPT (Agricultural Commodities Price Prediction)
25 pages
Dpzo Leb SN NP 270 - FS178
No ratings yet
Dpzo Leb SN NP 270 - FS178
16 pages