Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
Susan Li Follow
In the industry, there is a big demand for a powerful engine that can do all of
above. Sooner or later, your company or your clients will be using Spark to
develop sophisticated models that would enable you to discover new
opportunities or avoid risk. Spark is not hard to learn, if you already known
Python and SQL, it is very easy to get started. Let’s give it a try today!
Input variables: age, job, marital, education, default, balance, housing, loan,
contact, day, month, duration, campaign, pdays, previous, poutcome.
Have a peek of the first five observations. Pandas data frame is prettier than
Spark DataFrame.show().
import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()
Figure 2
import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()
Figure 3
Summary statistics for numeric variables
Figure 4
numeric_data = df.select(numeric_features).toPandas()
axs = pd.scatter_matrix(numeric_data, figsize=(8, 8));
n = len(numeric_data.columns)
for i in range(n):
v = axs[i, 0]
v.yaxis.label.set_rotation(0)
v.yaxis.label.set_ha('right')
v.set_yticks(())
h = axs[n-1, i]
h.xaxis.label.set_rotation(90)
h.set_xticks(())
Figure 5
It’s obvious that there aren’t highly correlated numeric variables. Therefore,
we will keep all of them for the model. However, day and month columns are
not really useful, we will remove these two columns.
Figure 6
The above code are taken from databricks’ official site and it indexes each
categorical column using the StringIndexer, then converts the indexed
categories into one-hot encoded variables. The resulting output has the
binary vectors appended to the end of each row. We use the StringIndexer
again to encode our labels to label indices. Next, we use the VectorAssembler
to combine all the feature columns into a single vector column.
Pipeline
Figure 7
pd.DataFrame(df.take(5), columns=df.columns).transpose()
Figure 8
As you can see, we now have features column and label column.
Randomly split data into train and test sets, and set seed for reproducibility.
Search Medium
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))
Figure 9
Summarize the model over the training set, we can also obtain the receiver-
operating characteristic and areaUnderROC.
trainingSummary = lrModel.summary
roc = trainingSummary.roc.toPandas()
plt.plot(roc['FPR'],roc['TPR'])
plt.ylabel('False Positive Rate')
plt.xlabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Figure 10
pr = trainingSummary.pr.toPandas()
plt.plot(pr['recall'],pr['precision'])
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.show()
Figure 11
predictions = lrModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)
Figure 12
evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))
One simple decision tree performed poorly because it is too weak given the
range of different features. The prediction accuracy of decision trees can be
improved by Ensemble methods, such as Random Forest and Gradient-
Boosted Tree.
Figure 14
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction',
'prediction', 'probability').show(10)
Figure 15
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions,
{evaluator.metricName: "areaUnderROC"})))
Gradient-Boosted Tree achieved the best results, we will try tuning this
model with the ParamGridBuilder and the CrossValidator. Before that we can
use explainParams() to print a list of all params and their definitions to
understand what params available for tuning.
print(gbt.explainParams())
Figure 16
paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxDepth, [2, 4, 6])
.addGrid(gbt.maxBins, [20, 60])
.addGrid(gbt.maxIter, [10, 20])
.build())
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=5)
0.8981050997838095