Distributed Machine Learning With PySpark
Distributed Machine Learning With PySpark
and Scikit-Learn
Abdelaziz Testas
and Scikit-Learn
Abdelaziz Testas
https://doi.org/10.1007/978-1-4842-9751-3
This work is subject to copyright. All rights are reserved by the Publisher,
whether the whole or part of the material is concerned, specifically the
rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or
hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather
than use a trademark symbol with every occurrence of a trademarked name,
logo, or image we use the names, logos, and images only in an editorial
fashion and to the benefit of the trademark owner, with no intention of
infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and
similar terms, even if they are not identified as such, is not to be taken as an
expression of opinion as to whether or not they are subject to proprietary
rights.
While the advice and information in this book are believed to be true and
accurate at the date of publication, neither the authors nor the editors nor
the publisher can accept any legal responsibility for any errors or omissions
that may be made. The publisher makes no warranty, express or implied,
with respect to the material contained herein.
Table of Contents
Similarity in Syntax
������������������������������������
������������������������������������
������������������������������������
����������� 4
Loading Data
������������������������������������
������������������������������������
������������������������������������
��������������������� 5
Selecting Columns
������������������������������������
������������������������������������
������������������������������������
������������ 6
Aggregating Data
������������������������������������
������������������������������������
������������������������������������
�������������� 9
Filtering Data
������������������������������������
������������������������������������
������������������������������������
������������������ 11
Joining Data
������������������������������������
������������������������������������
������������������������������������
�������������������� 14
Saving Data
������������������������������������
������������������������������������
������������������������������������
��������������������� 17
Modeling Steps
������������������������������������
������������������������������������
������������������������������������
��������������� 18
Pipelines
������������������������������������
������������������������������������
������������������������������������
������������������������� 20
Summary�������������������������������
������������������������������������
������������������������������������
������������������������������ 23
The Dataset
������������������������������������
������������������������������������
������������������������������������
��������������������� 26
Selecting Algorithms with Cross-Validation
������������������������������������
������������������������������������
�������� 34
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
��������������� 36
PySpark
������������������������������������
������������������������������������
������������������������������������
��������������������������� 42
Table of ConTenTs
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
�������������������� 47
PySpark
������������������������������������
������������������������������������
������������������������������������
��������������������������� 49
Summary�������������������������������
������������������������������������
������������������������������������
������������������������������ 51
and PySpark
������������������������������������
������������������������������������
�����������������������������������
53
The Dataset
������������������������������������
������������������������������������
������������������������������������
��������������������� 54
Summary�������������������������������
������������������������������������
������������������������������������
������������������������������ 73
Chapter 4: Decision Tree Regression with Pandas, Scikit-Learn, and
PySpark ������ 75
The Dataset
������������������������������������
������������������������������������
������������������������������������
��������������������� 76
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������������ 109
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������������� 111
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 113
and PySpark
������������������������������������
������������������������������������
��������������������������������� 115
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 116
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������������ 137
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������������� 139
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 142
vi
Table of ConTenTs
and PySpark
������������������������������������
������������������������������������
��������������������������������� 143
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 144
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 167
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 169
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 171
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 174
Logistic Regression
������������������������������������
������������������������������������
������������������������������������
������� 180
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 204
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 208
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 212
and PySpark
������������������������������������
������������������������������������
��������������������������������� 213
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 214
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 235
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 238
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 241
vii
Table of ConTenTs
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 253
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 255
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 258
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 260
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 278
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 280
and PySpark
������������������������������������
������������������������������������
��������������������������������� 281
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 282
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 294
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 295
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 297
viii
Table of ConTenTs
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 300
MLP Classification
������������������������������������
������������������������������������
������������������������������������
��������� 307
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 324
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 325
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 327
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 331
Surprise
������������������������������������
������������������������������������
������������������������������������
������������������� 351
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 352
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 353
and PySpark
������������������������������������
������������������������������������
��������������������������������� 355
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 356
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 389
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 391
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 394
ix
Table of ConTenTs
The Dataset
������������������������������������
������������������������������������
������������������������������������
������������������� 396
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 412
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 414
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 416
Examples of Hyperparameters
������������������������������������
������������������������������������
�������������������������� 417
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 436
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 438
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 440
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 458
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 459
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 461
Table of ConTenTs
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 470
Scikit-Learn
������������������������������������
������������������������������������
������������������������������������
������������� 479
PySpark
������������������������������������
������������������������������������
������������������������������������
������������������� 480
Summary�������������������������������
������������������������������������
������������������������������������
���������������������������� 482
Index
������������������������������������
������������������������������������
������������������������������������
��������� 483
xi
science projects and methodology work, and drove advanced solutions into
Nielsen’s digital ad and content rating products by leveraging subject matter
expertise in media measurement and data science. The author is passionate
about helping others improve their machine learning skills and workflows
and is excited to share his knowledge and experience with a wider audience
through this book.
xiii
xv
Acknowledgments
xvii
Introduction
In this context, this book will explore the benefits of using PySpark over
traditional single-node computing tools and provide guidance for data
scientists who are considering transitioning to PySpark.
In this book, we aim to provide a comprehensive overview of the main
machine learning algorithms with a particular focus on regression and
classification. These are fundamental techniques that form the backbone of
many practical applications of machine learning. We will cover popular
methods such as linear and logistic regression, decision trees, random
forests, gradient-boosted trees, support vector machines, Naive Bayes, and
neural networks. We will also discuss how these algorithms can be applied
to real-world problems, such as predicting house prices, and the likelihood
of diabetes as well as classifying handwritten digits or the species of an Iris
flower and predicting whether a tumor is benign or malignant. Whether you
are a beginner or an experienced practitioner, this book is designed to help
you understand the core concepts of machine learning and develop the skills
needed to apply these methods in practice.
xix
InTroduCTIon
This book spans 18 chapters and covers multiple topics. The first two
chapters examine why migration from Pandas and Scikit- Learn to PySpark
can be a seamless process, and address the challenges of selecting an
algorithm. Chapters 3–6 build, train, and evaluate some popular regression
models, namely, multiple linear regression, decision trees, random forests,
and gradient-boosted trees, and use them to deal with some real-world tasks
such as predicting house prices. Chapters 7–12 deal with classification
issues by building, training, and evaluating widely used algorithms such as
logistic regression, decision trees, random forests, support vector machines,
Naive Bayes, and neural networks. In Chapters 13–15, we examine three
additional types of algorithms, namely, recommender systems, natural
language processing, and clustering with k-means. In the final three
chapters, we deal with hyperparameter tuning, pipelines, and deploying
models into production.
xx
CHAPTER 1
An Easy Transition
One of the key factors in making the transition from Pandas and Scikit-
Learn to PySpark relatively easy is the similarity in functionality. This
similarity will become evident after reading this chapter and executing the
code described herein.
One of the easiest ways to test the code is by signing up for an online
Databricks Community Edition account and creating a workspace.
Databricks provides detailed documentation on how to create a cluster,
upload data, and create a notebook.
Additionally, Spark can also be installed locally through the pip install
pyspark command.
Although Pandas and Scikit-Learn are tools primarily designed for small
data processing and analysis, while PySpark is a big data processing
framework, PySpark offers functionality similar to Pandas and Scikit-Learn.
This includes DataFrame operations and machine learning algorithms. The
presence of these familiar functionalities in PySpark facilitates a smoother
transition for data scientists accustomed to working with Pandas and Scikit-
Learn.
In this chapter, we examine in greater depth the factors that contribute to the
ease of transition from these small data tools (Pandas and Scikit-Learn) to
PySpark. More specifically, we focus on PySpark and Pandas integration
and the similarity in syntax between PySpark, on the one hand, and Pandas
and Scikit-Learn, on the other.
1
© Abdelaziz Testas 2023
PySpark is well integrated with Pandas. Both the DataFrame and MLlib
(Machine Learning Library) designs in PySpark were inspired by Pandas
(and Scikit-Learn) concepts. The DataFrame concept in PySpark is similar
to that of Pandas in that the data is stored in rows and columns similar to
relational database tables and excel sheets.
[In]: pyspark_df.show()
[Out]:
Country
River
United States
Mississippi
Brazil
Amazon
Russia
Volga
Spark application.
[In]: print(pandas_df)
[Out]:
Country
River
United States
Mississippi
Brazil
Amazon
Russia
Volga
Similarity in Syntax
Another key factor that contributes to the smooth transition from small data
tools (Pandas and Scikit-Learn) to big data with PySpark is the familiarity
of syntax. PySpark shares a similar syntax with Pandas and Scikit-Learn in
many cases. For example, square brackets ([]) can be used on Databricks or
Google Colab to select columns directly from a PySpark DataFrame, just
like in Pandas. Similarly, PySpark provides a DataFrame API that
resembles Pandas, and Spark MLlib (Machine Learning Library) includes
implementations of various machine learning algorithms found in Scikit-
Learn.
Loading Data
Let’s start with loading and reading data. We can use the read_csv()
function in Pandas to load a CSV file called data.csv:
The first line imports the Pandas library and renames it as pd for
convenience. The second line uses the read_csv() function to read the CSV
file named data.csv. The contents of the file are stored in a DataFrame
called pandas_df.
In PySpark, we can use the spark.read.csv() method to do the same:
We can see from the preceding examples that despite some code differences
(e.g., PySpark requires us to specify the header and inferSchema options
explicitly, while in Pandas, these options have default values that are used if
we do not specify them), Pandas and PySpark read data in a fairly similar
fashion.
Selecting Columns
[Out]:
President
Year of Office
George Washington
1789
Abraham Lincoln
1861
Franklin D. Roosevelt
1933
John F. Kennedy
1961
Barack obama
2009
The first line of code imports the Pandas library and renames it as pd for
convenience. The second line creates a new Pandas DataFrame called
pandas_df. The DataFrame is initialized with a dictionary where the keys
represent the column names and the values represent the data for each
column.
In this example, there are two columns named President and Year of Office,
and each column has five rows of data. The third line of code selects
specific columns from the DataFrame. The double square brackets notation
pandas_df[["President", "Year of Office"]] is used to select both columns.
The resulting DataFrame contains these selected columns.
President
Year of Office
George Washington
1789
Abraham Lincoln
1861
Franklin D. Roosevelt
1933
John F. Kennedy
1961
Barack obama
2009
The code first imports the SparkSession class and then creates a new Spark
Session named spark with the application name Presidents. Next, it creates
a new PySpark DataFrame called spark_df. This is initialized with a list of
tuples where each tuple represents a row of data, and the column names are
specified as a separate list. In this example, there are two columns named
President and Year of Office, and each column has five rows of data. The
code then selects the columns from the DataFrame using double square
brackets notation. The resulting DataFrame contains the selected columns.
The last line of code displays the resulting DataFrame using the show()
method.
Note the
DataFrame, while pandas does not have a similar method as the results are
displayed immediately. this is because pySpark uses lazy evaluation to
optimize data processing by delaying computation until necessary, while
pandas uses 7
In PySpark, we have the option to select specific columns using the select()
method. On the other hand, in Pandas, we can achieve the same result using
the filter() function, as demonstrated here:
In PySpark:
[Out]:
President
Year of Office
George Washington
1789
Abraham Lincoln
1861
Franklin D. Roosevelt
1933
John F. Kennedy
1961
Barack obama
2009
In Pandas:
[Out]:
President
Year of Office
George Washington
1789
Abraham Lincoln
1861
2
Franklin D. Roosevelt
1933
John F. Kennedy
1961
Barack obama
2009
While the specific operations performed by select() and filter() methods are
different, the resulting objects share the common trait of containing a subset
of the original DataFrame’s data.
Aggregating Data
Not only that Pandas and PySpark can read data and select columns in the
same way, they can also aggregate data in a similar fashion.
Step 4: Aggregate by movie using the groupby() method, and then calculate
the mean revenue for each movie using the agg() method with a dictionary
specifying that we want to calculate the mean for the revenue column
[In]: print(agg_pandas_df)
[Out]:
movie
revenue
Avengers
90000000
1
Frozen
70000000
harry potter
100000000
Star Wars
110000000
80000000
[In]: spark =
SparkSession.builder.appName("MovieRevenueAggregation").
getOrCreate()
80000000)]
spark_df = spark.createDataFrame(spark_data, ['movie', 'revenue'])
[In]: agg_spark_df.show()
[Out]:
movie
revenue
Avengers
90000000
Frozen
70000000
harry potter
100000000
Star Wars
110000000
80000000
10
Filtering Data
Let’s start with Pandas. We create a DataFrame called pandas_df with four
rows of data and two columns, animal and muscle_power, and then filter it
to show only the rows where the muscle_power is greater than 350:
Step 3: Filter the DataFrame to show only rows where the muscle_power is
greater than 350:
[In]: print(filtered_pandas_df)
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
11
getOrCreate()
Step 2: Create a list of tuples called spark_data, with each tuple containing
an animal and muscle_power
power"])
Step 4: Filter the DataFrame to show only rows where the muscle_power is
greater than 350:
[In]: filtered_spark_df.show()
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
We can also use the query() and filter() methods in Pandas and PySpark,
respectively, to show only rows where muscle_power is greater than 350.
These are equivalent to square brackets notation,
pandas_df[pandas_df['muscle_power'] > 350] and
spark_df[spark_df['muscle_power'] > 350].
12
In Pandas:
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
In PySpark:
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
There are at least two more ways to achieve the same filtering result in
Spark DataFrame:
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
13
ChApteR 1 An eASy tRAnSition
[Out]:
animal
muscle_power
Lion
400
elephant
6000
Gorilla
800
The preceding examples demonstrate that both Pandas and PySpark provide
various methods for data filtering and, to a significant extent, there are
similarities in their respective approaches.
Joining Data
'Project Manager'],
14
[In]: print(pandas_df1)
[Out]:
Job Title
Salary
Location
Software engineer
100000
San Francisco
Data Analyst
75000
new york
project Manager
90000
Seattle
'Project Manager'],
[In]: print(pandas_df2)
[Out]:
Job Title
Salary
Location
Software engineer
120000
Los Angeles
Data Scientist
95000
Chicago
project Manager
90000
Boston
[In]: print(merged_pandas_df)
[Out]:
Job Title
Salary_x
Location_x
Salary_y
Location_y
Software engineer
100000
San Francisco
120000
Los Angeles
project Manager
90000
Seattle
90000
Boston
15
'Project Manager'],
[In]: spark_df1.show()
[Out]:
Job Title
Salary
Location
Software engineer
100000
San Francisco
Data Analyst
75000
new york
project Manager
90000
Seattle
'Project Manager'],
[In]: spark_df2.show()
[Out]:
Job Title
Salary
Location
Software engineer
120000
Los Angeles
Data Scientist
95000
Chicago
project Manager
90000
Boston
16
ChApteR 1 An eASy tRAnSition
how='inner')
[In]: joined_spark_df.show()
[Out]:
Job Title
Salary
Location
Salary
Location
Software engineer
100000
San Francisco
120000
Los Angeles
project Manager
90000
Seattle
90000
Boston
Notice that in Pandas, the suffixes _x and _y are automatically added to the
column labels when there are overlapping column names in the two
DataFrames being merged.
This helps to differentiate between the columns with the same name in the
merged DataFrame. In the preceding example, both pandas_df1 and
pandas_df2 have a column named Salary and a column named Location.
When performing the merge, Pandas automatically appends the suffixes _x
and _y to the column labels to distinguish them.
This, however, is not the case in PySpark. It does not automatically append
suffixes like _x and _y to the column labels when there are overlapping
column names in the DataFrames being joined.
Saving Data
Note When writing to a CSV file in pySpark, the data is split into multiple
CSV
files and written to a directory. the output file name specified is the name of
the directory, not the file itself.
Modeling Steps
1. Data preparation:
18
2. Model training:
as input.
vector column.
3. Model evaluation:
4. Prediction:
predictions for new data, where you provide the features (X_test)
making predictions to ensure the new data has the same feature
vector structure.
19
Pipelines
pipeline.
package.
transformer or estimator.
Scikit-Learn.
the fit method on the pipeline object and providing the training
20
fit method on the pipeline object and providing the training data
as well.
processes the data and passes it to the next stage until the final
model training.
6. Making predictions:
[In]: steps = [
('preprocessing', PreprocessingStep()),
('feature_selection', FeatureSelectionStep()),
('regressor', RegressionModel())
21
[In]: stages = [
PreprocessingStep(),
FeatureSelectionStep(),
ModelStep()
22
ChApteR 1 An eASy tRAnSition
Summary
In the next chapter, we embark on our journey into machine learning, which
forms the core of this book. We will introduce k-fold cross-validation, a
technique that helps us select the best-performing model from a range of
different algorithms. Using this method, we can short-list the top-
performing models for further evaluation and testing.
23
CHAPTER 2
Selecting Algorithms
Before building our model, let’s first explore the dataset using Pandas and
PySpark.
This will allow us to see how these two libraries utilize similar approaches
to reading and processing data.
25
The Dataset
The dataset used for this project is known as the Pima Indians Diabetes
Database. Pima refers to the Pima people, who are a group of Native
American tribes indigenous to the Southwestern United States, primarily
living in what is now Arizona. The Pima people have a unique cultural
heritage and history.
The dataset was originally created by the National Institute of Diabetes and
Digestive and Kidney Diseases. For the purpose of this analysis, we sourced
it from Kaggle—a website designed for data scientists. The following are
the contributor’s name, approximate upload date, dataset name, site name,
and the URL from which we downloaded a copy of the CSV file:
Source: Kaggle
URL: www.kaggle.com/uciml/pima-indians-diabetes-database
Date: 2016
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/diabetes.csv')
return pd.read_csv(url)
26
In the next step, we take a look at the top five rows using the Pandas head()
method:
[In]: pandas_df.head()
[Out]:
Skin
Insulin BMI
Pressure Thickness
Pedigree
Function
06
148
72
35
33.6 0.627
50
11
85
66
29
26.6 0.351
31
28
183
64
0
23.3 0.672
32
31
89
66
23
94
28.1 0.167
21
40
137
40
35
168
43.1 2.288
33
[Out]:
Thickness
Pedigree
Function
148
72
35
33.6 0.627
50
85
66
29
0
26.6 0.351
31
183
64
23.3 0.672
32
89
66
23
94
28.1 0.167
21
0
137
40
35
168
43.1 2.288
33
27
We can see from the output of both Pandas and PySpark that there are nine
columns (eight attributes and one outcome variable). Pregnancies refer to
the count of pregnancies a woman has experienced. Glucose pertains to the
concentration of glucose in the bloodstream two hours after an oral glucose
tolerance test. BloodPressure signifies the diastolic blood pressure,
measured in millimeters of mercury (mm Hg).
[Out]:
<class 'pandas.core.frame.DataFrame'>
28
instances, all of which are numerical and none of which are null. The data
types are either int64 (integer) or float64 (decimal).
[In]: spark_df.printSchema()
[Out]:
root
[In]: spark_df.count()
[Out]: 768
[Out]:
29
Skin
Insulin
BMI
Diabetes Age
Pressure Thickness
Pedigree
Function
count 768
768
768
768
768
768
768
768
mean 3.85
120.89
69.11
20.54
79.80
31.99
0.47
33.24
std
3.37
31.97
19.36
15.95
115.24
7.88
0.33
11.76
min
0.00
0.00
0.00
0.00
0.00
0.00
0.08
21.00
25%
1.00
99.00
62.00
0.00
0.00
27.30
0.24
24.00
50%
3.00
117.00
72.00
23.00
30.50
32.00
0.37
29.00
75%
6.00
140.25
80.00
32.00
127.25
36.60
0.63
41.00
max
17.00
199.00
122.00
99.00
846.00
67.10
2.42
81.00
In PySpark, we can obtain the same results using the summary() method,
which would give us the same summary statistics, including the 25%, 50%,
and 75% percentiles.
[In]: spark_df.drop('Outcome').summary().show()
[Out]:
summary
Pregnancies
Glucose
Blood
BMI
Diabetes
Age
Pressure
Pedigree
Function
count
724
724
724
724
724
724
mean
3.87
121.88
72.40
32.47
0.47
33.35
stddev
3.36
30.75
12.38
6.89
0.33
11.77
min
44
24
18.2
0.078
21
25%
99
64
27.5
0.245
24
50%
3
117
72
32.4
0.378
29
75%
142
80
36.6
0.627
41
max
17
199
122
67.1
2.42
81
30
Chapter 2 SeleCting algorithmS
Turning now to the Outcome variable, we can find out how many women
have diabetes and how many do not have diabetes using the value_counts()
method in Pandas:
[In]: pandas_df["Outcome"].value_counts()
[Out]:
0 500
1 268
In PySpark, we can use the groupBy() and count() methods to get the same
results:
[In]: spark_df.groupBy("Outcome").count().show()
[Out]:
+-------+-----+
|Outcome|count|
+-------+-----+
| 0| 500|
| 1| 268|
+-------+-----+
Both Pandas and PySpark return 268 women for category 1 (i.e., woman
has diabetes) and 500 women for category 0 (i.e., woman does not have
diabetes). This, as expected, adds up to the total of 768 observations.
In the next step, we will check if there are missing values. This is important
because machine learning algorithms generally require complete data sets to
be trained effectively. This is because the algorithms use statistical methods
to find patterns in the data, and missing data can disrupt the accuracy of
those patterns.
We saw earlier, using the info() and printSchema() methods, that there are
no null values. We can replicate this by counting the number of null values
in each column of the pandas_df using the Pandas isnull() and sum()
methods:
[In]: pandas_df.isnull().sum()
[Out]:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
31
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
In PySpark, we can also use the isNull() and sum() methods but in a slightly
different way. More precisely, we use
sum(col(c).isNull().cast("int")).alias(c) to generate a PySpark expression
that will count the number of null values in a single column c of the
dataframe (c is a variable that is used in a list comprehension to generate a
list of expressions, where each expression calculates the number of null
values in the corresponding column and gives it an alias with the same
column name):
[Out]:
Pregnancies
Glucose
Blood
BMI
Diabetes
Age
Outcome
Pressure
Pedigree
Function
0
0
Both Pandas and PySpark outputs suggest that there are no null values in
the dataset.
However, having 0 nulls does not always indicate that there are no missing
values in a dataset. This is because in some cases, missing values are
replaced with 0s.
Let’s use the following Pandas code to calculate the total number of 0
values in each of the nine columns, including the Outcome variable:
[Out]:
Pregnancies 111
Glucose 5
BloodPressure 35
SkinThickness 227
32
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 500
dtype: int64
In PySpark, we first import the sum() and col() functions then generate
expressions that count the number of zeros in each column:
df.columns]
Next, we apply the expressions to the dataframe and show the result:
[In]: spark_df.select(exprs).show()
[Out]:
Skin
Insulin BMI
Pressure Thickness
Pedigree
Function
111
5
35
227
374
11
500
Pandas and PySpark both indicate that there are no 0 cases for Diabetes
Pedigree Function and Age, but there are many cases with 0 values for the
other variables. For example, 111 cases have 0 values for the Pregnancy
variable, which makes sense since some females in the sample have no
kids. Additionally, 500 cases have 0 values for the Outcome variable,
indicating that these women do not have diabetes. However, it is illogical
for Glucose, Blood Pressure, Skin Thickness, Insulin, or BMI to have 0
values.
There are three common ways to deal with invalid readings: exclude
columns or features with 0 values, exclude rows with 0 values, or impute 0
values with mean or average values. For this chapter, we have chosen
options 1 and 2:
with 0 values is too large (227 and 374, respectively). Excluding rows
33
& (pandas_df['BloodPressure'] != 0)
In PySpark:
& (col('BloodPressure') != 0)
We end up with 7 columns (6 features and 1 class variable) and 724 rows.
We can confirm these counts by using the shape attribute in Pandas:
[In]: print(pandas_df.shape)
[Out]: (724, 7)
PySpark doesn’t have the shape attribute, but we can accomplish the same
task by combining the count() and len() functions:
[Out]: (724, 7)
Selecting Algorithms with Cross-Validation
This method is superior to training and testing the model on just one dataset
for a number of reasons:
34
and evaluate the model, we can get a better sense of how well the
3. By splitting the data into multiple folds, we can use more data
Before computing the ROC AUC for each classifier, we will first
standardize the features using the StandardScaler function. Standard
scaling, also known as z-score normalization or standardization, is a data
preprocessing technique used to transform numerical features to have a
mean of 0 and a standard deviation of 1. It involves subtracting the mean of
each feature from the data and then dividing by the standard deviation.
where mean(x) is the mean of feature x and std(x) is its standard deviation.
There are other differences between Scikit-Learn and PySpark. The first is
that Scikit-Learn separates cross-validation from hyperparameter tuning
while PySpark combines them by default. The second is that PySpark
combines the features in a single vector while Scikit-Learn does not.
Scikit-Learn
spark_book/main/diabetes.csv'
& (pandas_df['BloodPressure'] != 0)
['Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age',
'Outcome']]
After generating this DataFrame that filters out invalid diabetes readings,
the first step in the k-fold cross-validation process is to import the required
libraries: 36
• LogisticRegression, DecisionTreeClassifier,
Step 2: Define the feature and target columns. Assign the feature columns
to the variable x and the target column to the variable y:
[In]: x_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'BMI',
'DiabetesPedigreeFunction', 'Age']
[In]: x = pandas_df[x_cols]
[In]: y = pandas_df["Outcome"]
RandomForestClassifier(), LinearSVC(max_iter=1500)]
37
result = cross_val_score(
classifier,
x_scaled,
y,
cv=10,
scoring= "roc_auc" )
print(f"{classifier.__class__.__name__}
mean: {result.mean():.2f}")
[Out]:
indicates the average ROC AUC score across the ten cross-validation folds.
The ROC AUC
The curve is created by plotting the true positive rate (TPR) against the
false positive rate (FPR) at various threshold values.
38
The ROC curve is generated through the plotting of TPR (sensitivity) along
the vertical y axis and FPR (1-specificity) along the horizontal x axis. This
curve visually portrays how alterations in the discrimination threshold
impact the classifier’s efficacy.
In an ideal scenario, the ROC curve for a classifier would ascend directly
from the origin (FPR = 0) and continue horizontally to the upper-right
corner (TPR = 1), signifying a perfect differentiation between the two
classes.
In real-world scenarios, most classifiers yield curves that curve toward the
upper-left corner. The proximity of the ROC curve to this corner is directly
proportional to the classifier’s performance quality—closer alignment
indicating superior performance. The area beneath the ROC curve (AUC)
serves as a single metric encapsulating the classifier’s performance across
all conceivable threshold values. An AUC value of 1 signifies a perfect
classifier, while an AUC value of 0.5 implies a classifier performing no
better than random chance.
39
Let’s go ahead and plot the ROC curves of the decision tree and random
forest classifiers to better understand what they are. We can achieve this
with the following code, using the same pandas_df DataFrame we
generated earlier to train and evaluate the models:
decision_tree_model,
x,
y,
cv=10,
method="predict_proba"
random_forest_model,
x,
y,
cv=10,
method="predict_proba"
)
decision_tree[:, 1])
random_forest[:, 1])
[In]: plt.figure()
[In]: plt.plot(fpr_decision_tree, tpr_decision_tree, color='darkorange', lw=2,
label='Decision Tree (area = %0.2f)' % roc_auc_decision_tree)
[In]: plt.show()
[Out]:
The ROC curve for the random forest classifier with an ROC AUC of 0.83
is positioned closer to the top-left corner of the plot than the decision tree
classifier. This suggests that the random forest classifier has a better ability
to correctly classify positive and negative instances, with a more moderate
false positive rate.
41
PySpark
main/diabetes.csv'
& (col('BloodPressure') != 0)
42
.select(['Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age',
'Outcome'])
In a nutshell, our first step involves reading the CSV file into a Pandas
DataFrame.
• LogisticRegression, DecisionTreeClassifier,
performance
Step 2: Combine the feature columns into a single vector column named
features using VectorAssembler:
'Pregnancies',
'Glucose',
43
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age', ], outputCol="features")
To see what the VectorAssembler does, let’s print the top five rows of the
“data”
characters):
[Out]:
Pressure
Pedigree
Function
6
148
72
33.6 0.627
50
[6,148,72,33.6,0.627,50]
85
66
26.6 0.351
31
[1,85,66,26.6,0.351,31]
183
64
23.3 0.672
32
[8,183,64,23.3,0.672,32]
1
89
66
28.1 0.167
21
[1,89,66,28.1,0.167,21]
137
40
43.1 2.288
OceanofPDF.com
33
[0,137,40,43.1,2.288,33]
We can see from the preceding post-vector assembling output that the
features have been combined into a single vector named features. The first
row, for example, combines the individual values of pregnancies, glucose,
blood pressure, BMI, diabetes pedigree function, and age into the numerical
vector [6,148,72,33.6,0.627,50].
44
Notice that the label column (Outcome) is not included in this assembly
process, as vector assembling is exclusively meant for the features,
excluding the target variable.
features")
While decision trees are not directly affected by feature scaling, ensemble
methods like random forests involve aggregating multiple decision trees.
Scaling can help improve the stability and performance of these ensemble
methods by ensuring that the tree-building process is not influenced by
varying scales of features. This can lead to more accurate predictions and
better generalization.
[In]: classifiers = [
LogisticRegression(),
DecisionTreeClassifier(),
RandomForestClassifier(),
LinearSVC(maxIter=1500)])]
45
• Transform the data with the model and evaluate the performance
• Print the name of the classifier and its mean accuracy as measured by the
ROC AUC.
cv = CrossValidator(
estimator=classifier, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=10)
cvModel = cv.fit(data)
results = cvModel.transform(data)
accuracy = evaluator.evaluate(results)
46
print(f"{classifier.__class__.__name__} mean:
{accuracy:.2f}")
[Out]:
The results indicate that random forest had the best performance with an
AUC of 0.91, followed by logistic regression and support vector machine
with an AUC of 0.84
each. The decision tree classifier did not perform as well, with an AUC of
0.78.
These findings are not exactly the same as those from Scikit-Learn as the
two frameworks use different default hyperparameters, which can cause
differences in the accuracy metric (we will explore hyperparameter tuning
in great detail in Chapter 16).
Additionally, the k-fold sampling is random, so each time the k-fold cross-
validation is performed, a different set of data is used for training and
validation, resulting in a different model and evaluation metric.
Bringing It All Together
Scikit-Learn
47
main/diabetes.csv'
['Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age',
'Outcome']]
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age']
[In]: x = pandas_df[x_cols]
[In]: y = pandas_df["Outcome"]
[In]: classifiers = [
LogisticRegression(),
DecisionTreeClassifier(),
RandomForestClassifier(),
LinearSVC(max_iter=1500)])
48
# Perform 10-fold cross-validation for each model and print the mean of
accuracy
result = cross_val_score(
classifier,
x_scaled,
y,
cv=10,
print(f"{classifier.__class__.__name__} mean:
{result.mean():.2f}")
PySpark
main/diabetes.csv'
49
.select(['Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age',
'Outcome'])
'Pregnancies',
'Glucose',
'BloodPressure',
'BMI',
'DiabetesPedigreeFunction',
'Age'], outputCol="features")
features")
[In]: classifiers = [
LogisticRegression(),
DecisionTreeClassifier(),
RandomForestClassifier(),
LinearSVC(maxIter=1500)]
50
cv = CrossValidator(estimator=classifier,
estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=10)
cvModel = cv.fit(data)
results = cvModel.transform(data)
accuracy = evaluator.evaluate(results)
print(f"{classifier.__class__.__name__} mean:
{accuracy:.2f}")
Summary
In the next chapter, we demonstrate how to build, train, evaluate, and use a
different type of supervised machine learning: multiple linear regression.
We will show that the steps involved in machine learning, including
splitting data, model training, model evaluation, and prediction, are the
same in both Scikit-Learn and PySpark. Furthermore, we will demonstrate
how Pandas and PySpark have similar approaches to data manipulation,
which simplifies tasks like exploring data.
51
CHAPTER 3
Multiple Linear
This chapter demonstrates how to build, train, evaluate, and use a multiple
linear regression model in both Scikit-Learn and PySpark. It shows that the
steps involved in machine learning, including splitting data, model training,
model evaluation, and prediction, are the same in both frameworks.
Furthermore, Pandas and PySpark have similar approaches to data
manipulation, which simplifies tasks like exploring data.
These similarities aid the data scientist in switching from in-memory data
processing tools such as Pandas and Scikit-Learn to PySpark, which is
designed for distributed processing of large datasets across multiple
machines. However, it is important to note that these general similarities do
not override the unique nuances of each tool. Another goal of this chapter is
to explain these differences.
53
!pip install pyspark command. PySpark code can also be run locally by
installing PySpark on a local machine using the same command.
The Dataset
The first step in our project is to generate some data to demonstrate how
data exploration can be done using Pandas and PySpark, followed by
machine learning using Scikit-Learn and PySpark for multiple linear
regression.
state=42)
[In]: pandas_df = pd.DataFrame({'Feature 1': X[:, 0], 'Feature 2': X[:, 1],
'Target': y})
This code generates regression data with two features and random noise
using the make_regression function from Scikit-Learn. The n_samples
parameter specifies the number of samples, n_features specifies the number
of features, noise controls the amount of random noise added to the data,
and random_state ensures reproducibility.
The generated features are stored in the variable X, and the corresponding
target variable is stored in the variable y. The code converts the NumPy
arrays X and y into a Pandas DataFrame named pandas_df. This is
constructed using a dictionary where the keys are column names (Feature 1,
Feature 2, Target) and the values are extracted from the X and y arrays. X[:,
0] represents the first column of X, X[:, 1] represents the second column,
and y represents the target values. The values from X are assigned to the
respective columns Feature 1 and Feature 2, while the values from y are
assigned to the Target column.
54
getOrCreate()
"Target"])
To work with the data in a PySpark environment, the code converts the
NumPy X
The first line imports the SparkSession class, which is the entry point for
working with structured data in Spark and is used to create a DataFrame.
The second line initializes a SparkSession with the name
“PySparkRegressionData”. SparkSession.
Now, let’s begin by inspecting the top five rows of each DataFrame as an
initial step in our exploratory data analysis. The head(n) method in Pandas
and the show(n) method in PySpark are used to display the first n rows of a
DataFrame. By default, n=5 in Pandas and n=20 in PySpark. We have
specified the first five rows to override the default values.
55
[In]: pandas_df.head()
[Out]:
Feature 1
Feature 2
Target
-1.19
0.66
-49.27
0.06
-1.14
-85.16
0.59
2.19
211.23
3
0.47
-0.07
29.20
0.74
0.17
84.35
In PySpark, using the show() method to display the same five rows:
[In]: spark_df.show(5)
[Out]:
Feature 1
Feature 2
Target
-1.19
0.66
-49.27
0.06
-1.14
-85.16
0.59
2.19
211.23
0.47
-0.07
29.20
0.74
0.17
84.35
We can see that each DataFrame has three columns: Feature 1, Feature 2,
and Target.
We can investigate the data a bit further using Pandas and PySpark to
understand how they compare. We can check the shape of the datasets as
follows:
[In]: print(pandas_df.shape)
[Out]: (100, 3)
56
[Out]: (100, 3)
In both Pandas and PySpark, the output is a tuple containing the number of
rows and columns of the DataFrame. We can see that each DataFrame has
100 rows and 3
columns.
We can call the dtypes attribute in both Pandas and PySpark to check the
data type of each column in the dataset:
In Pandas:
[In]: print(pandas_df.dtypes)
[Out]:
Feature 1 float64
Feature 2 float64
Target float64
dtype: object
We can see that all the columns are of the decimal type (float64). In
PySpark, the equivalent data type is double, as shown in the output of the
following code:
[In]: print(spark_df.dtypes)
'double')]
We can call the columns attribute to return the name of each column in the
DataFrame.
In Pandas:
[In]: print(pandas_df.columns)
[Out]: Index(['Feature 1', 'Feature 2', 'Target'], dtype='object')
The output is an Index object that contains three labels or column names in
the following order: Feature 1, Feature 2, and Target. The Index object is an
important component of a Pandas DataFrame, as it allows to select specific
rows or columns of data using the labels as a reference.
57
In PySpark:
[In]: print(spark_df.columns)
PySpark does not use the concept of indexing like Pandas because PySpark
is designed to handle distributed computing across multiple machines,
while Pandas is designed for in-memory data processing on a single
machine. In PySpark, the data is typically stored in a distributed manner
across multiple machines, and the computations are executed in parallel on
those machines. This makes the indexing concept less efficient in PySpark,
as indexing can result in a bottleneck when working with large datasets that
are distributed across multiple machines.
[In]: pandas_df.info()
[Out]:
<class 'pandas.core.frame.DataFrame'>
dtypes: float64(3)
58
[Out]:
root
[In]: pandas_df.describe()
[Out]:
Feature 1
Feature 2
Target
Count
100.00
100.00
100.00
Mean
-0.12
0.03
-7.20
std
0.86
1.00
106.64
Min
-2.62
-1.99
-201.72
25%
-0.78
-0.71
-84.39
50%
-0.04
0.11
-17.45
75%
0.34
0.66
70.90
Max
1.89
2.72
211.23
The output includes the count, mean, standard deviation (a measure of how
dispersed the values are), min, and max values of each of the numerical
columns.
Included are also the 25%, 50%, and 75% percentiles (often referred to as
the 1st quartile, the median, and the 3rd quartile, respectively). Each
percentile indicates the value below which a given percentage of
observations falls. For example, 75% of Feature 1 have a value lower than
0.34.
59
[In]: spark_df.summary().show()
[Out]:
Summary
Feature 1
Feature 2
Target
Count
100.00
100.00
100.00
Mean
-0.12
0.03
-7.20
stddev
0.86
1.00
106.64
Min
-2.62
-1.99
-201.72
25%
-0.78
-0.71
-84.39
50%
-0.04
0.11
-17.45
75%
0.34
0.66
70.90
Max
1.89
2.72
211.23
PySpark also offers the describe() method as in Pandas, but its output only
includes count, mean, std, min, and max values for each numerical column.
The PySpark summary() function, on the other hand, shows all the
summary statistics including the percentiles.
Now that we have generated some data, we can proceed to demonstrate how
to perform multiple linear regression using Scikit-Learn and PySpark. We
will build, train, and evaluate a multiple linear regression model using the
same dataset we generated in the previous section. The goal is to model the
relationship between the predictors (Feature 1, Feature 2) and the response
variable (Target) by fitting a linear equation to the data.
60
Chapter 3 Multiple linear regression with pandas, sCikit-learn, and pyspark
Theory
The equation of the multiple linear regression line has the following form: y
=w
0 + w 1 x 1 + w 2 x 2 + …+ wn xn
Scikit-Learn/PySpark Similarities
Both Scikit-Learn and PySpark follow the same modeling steps: data
preparation, training and evaluating the model, and predicting the target
variable using new, unseen data.
The following are the key machine learning functions in Scikit-Learn and
their equivalents in PySpark.
1. Data preparation:
61
2. Model training:
as input.
vector column.
3. Model evaluation:
columns.
4. Prediction:
predictions for new data, where you provide the features (X_test)
62
making predictions to ensure the new data has the same feature
vector structure.
Let’s now write some code to convert these descriptive steps into practice
beginning with Scikit-Learn.
random_state=42)
test_size=0.2, random_state=42)
63
The code imports specific modules from the sklearn library as they are
needed for implementing the machine learning algorithm:
testing models.
testing sets.
This step involves splitting the data into training and testing sets. The code
splits the dataset into training and testing sets, with 80% of the data used for
training (X_train and y_train) and 20% for testing (X_test and y_test).
The code creates and trains a linear regression model using the training
data. It first creates an instance of the LinearRegression model and then
uses the fit() method to train the model using the training data.
64
The Intercept represents the bias term of the linear regression model. The
intercept is an additional constant term that is added to the weighted sum of
the features. It represents the expected or average value of the target
variable when all the features are zero or have no impact. In this case, the
intercept is 0.066, meaning that when all the features have zero values, the
model predicts an output value of approximately 0.066.
Step 5: Prediction
The trained model is used to make predictions on the test data. It predicts
the output (y) based on the input (X_test) using the predict() method.
We can compare the actual and predicted values side by side by printing the
top five rows as follows:
[In]: print("Actual\tPredicted")
[Out]:
Actual
Predicted
-49.72
-56.82
4.15
-0.48
82.05
84.59
-34.16
-20.40
69.78
66.99
We can see that there are differences between the actual and predicted
values. For example, the first observation indicates that the model’s
predictions are lower than the actual values since -56.82 is less than -49.72.
A sample of five data points, however, does not provide a comprehensive
evaluation of the model’s overall performance. To assess the model’s
quality, we need to calculate evaluation metrics such as mean squared error
(MSE) and R-squared.
65
The code evaluates the performance of the model using mean squared error
(MSE) and R-squared. It calculates the MSE between the predicted values
(y_pred) and the true values (y_test) as well as the R-squared value, which
measures how well the linear regression model fits the data.
We can print the results using the built-in print() function as follows:
The MSE measures the average squared difference between the predicted
values and the actual values of the target variable. It quantifies the overall
accuracy of the model’s predictions, with lower values indicating better
performance. In our case, an MSE of 154.63 means that, on average, the
squared difference between the predicted and actual values is 154.63.
While the obtained MSE of 154.63 and R-squared of 0.98 provide initial
insights into the model’s performance, it’s important to note that these
values are based on randomly generated data. Real-world scenarios often
involve more complex and diverse datasets.
For the sake of illustration, let’s assume that we are satisfied with the
performance of this model. What should we do next? The next step would
be to utilize the trained model for making predictions on new, unseen data.
66
[In]: print(new_predictions)
The new_data variable represents a list of input feature values for which we
want to obtain predictions. Each inner list within new_data corresponds to a
different set of input features. In our example, there are three sets of input
features: [1.0, 2.0], [3.0, 4.0], and [5.0, 6.0].
By calling the predict() method, we pass the new_data to the trained model
to obtain predictions for each set of input features. The resulting
predictions, [233.71, 553.67, 873.63], are stored in the new_predictions
variable and then printed to the screen using the print() function.
Now, let’s demonstrate how to convert the data preparation, model training,
model evaluation, and prediction steps into PySpark code to leverage
distributed computing.
getOrCreate()
state=42)
[In]: data = [(float(X[i, 0]), float(X[i, 1]), float(y[i])) for i in
range(len(X))]
"Target"])
67
predictionCol="prediction", metricName="mse")
[In]: mse = evaluator.evaluate(predictions)
[In]: r2 = model.summary.r2
The following are the imported classes and functions and their role in the
modeling process:
for testing.
performance.
68
This step has been explained in detail in the “The Dataset” section. The
code uses the make_regression function to generate simulated regression
data. It creates a two-dimensional feature matrix X and a target variable y.
The code converts the NumPy arrays X and y into a Spark DataFrame data
by creating a list of tuples and specifying the column names as [Feature 1,
Feature 2, Target].
The code randomly splits the data DataFrame into training and testing
DataFrames with an 80:20 ratio. The randomSplit method is used, and the
seed is set to 42 for reproducibility.
The first line of the code in this step creates a VectorAssembler object,
specifying the input columns [Feature 1, Feature 2] and the output column
name as features. The next two lines transform the training and testing
DataFrames by applying the assembler to combine the input features into a
single vector column named features.
69
Chapter 3 Multiple linear regression with pandas, sCikit-learn, and pyspark
In contrast, Scikit-Learn does not require a separate step for combining
features because it accepts input data in a different format. Scikit-Learn’s
machine learning algorithms typically work with NumPy arrays or Pandas
DataFrames, where each feature is represented by a separate column.
Let’s print the top five rows of the train_data to see how the data looks:
[In]: train_data.show(5)
[Out]:
Feature 1
Feature 2
Target
Features
-1.92
-0.03
-177.39
[-1.92,-0.03]
-1.19
0.66
-49.27
[-1.19,0.66]
-0.89
-0.82
-132.59
[-0.89,-0.82]
-0.45
0.86
27.34
[-0.45,0.86]
0.06
-1.14
-85.16
[0.06,-1.14]
We can see from the Features column that the values of Feature 1 and
Feature 2 (e.g.,
-1.92 and -0.03, respectively) have been combined into the numerical vector
[-1.92,
The code in this step creates a LinearRegression object, specifying the input
feature column as features and the label column as Target. It fits the linear
regression model on the training data (train_data) using the fit() method.
The output represents the coefficients and intercept of the trained linear
regression model. The coefficients indicate the weights assigned to each
feature in the regression model, which, in turn, indicate the impact or
importance of each feature on the target variable. The first coefficient of
86.03 corresponds to Feature 1, and the second coefficient of 73.24
corresponds to Feature 2.
70
These coefficients and the intercept are the learned parameters of the linear
regression model. They represent the relationship between the features and
the target variable, allowing the model to make predictions based on new
input data. This prediction step is explained next.
The code in this step applies the trained model to the test data (test_data) to
make predictions. The transform method adds a prediction column to the
DataFrame predictions with the predicted values.
We can print the top five rows of the actual and predicted values as follows:
[In]: actual_vs_pred.show(5)
[Out]:
Target
prediction
-192.19
-183.11
29.20
35.05
-14.49
-9.28
-62.52
-57.03
-51.66
-37.62
We can observe from the preceding output that there are differences
between the actual target and predicted values, most of which are quite
significant. To get a better understanding of the model’s performance,
however, we need to look at metrics such as the MSE and r2, which utilize
the entire sample.
The output consists of two evaluation metrics for the linear regression
model: 1. The MSE is a measure of the average squared difference between
the predicted values and the actual values. In this example, the
actual values.
value suggests that the model provides a good fit to the data.
Assuming that we are satisfied with the model, we can now use it to predict
the target variable using new data:
"Feature 2"])
[In]: new_predictions.show()
[Out]:
72
Feature 2
Features
prediction
[1,2]
232.18
4
[3,4]
550.72
[5,6]
869.26
The steps of applying the model to new data are explained as follows:
2. The
column called features. This step aligns the new data with the
4. The
Summary
73
74
CHAPTER 4
and PySpark
We will be using an open source dataset that we have harvested from the
Kaggle website. As we have done in the previous chapter, we will start by
reading the dataset and then delve into it to understand more about the
features and the target variable.
75
The dataset we are using to train and evaluate the decision tree regression
model in Scikit-Learn and PySpark is widely known as the housing dataset.
This contains 545
records and 12 features, with each record representing a house and the
target variable indicating its price. The features represent the attributes of a
house, including its total area, the number of bedrooms, bathrooms, stories,
and parking spaces, as well as whether the house is connected to the main
road, has a guest room, a basement, a hot water heating system, and an air-
conditioning system, or is located in a preferred area, and whether it is fully
furnished, semi-furnished, or unfurnished.
Site: Kaggle
URL: www.kaggle.com/datasets/yasserh/housing-prices-dataset
Contributor: Yasser H.
Date: 2021
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
76
[In]: pandas_df.head()
[Out]:
price
area
bedrooms
bathrooms
stories
mainroad
guestroom
0
13300000
7420
yes
no
12250000
8960
yes
no
12250000
9960
2
2
yes
no
12215000
7500
yes
no
11410000
7420
yes
yes
yes
yes
furnished
1 no
no
yes
no
furnished
2 yes
no
no
yes
semi-furnished
3 yes
no
yes
3
yes
furnished
4 yes
no
yes
no
furnished
[In]: spark_df.show(5)
[Out]:
77
area
bedrooms
bathrooms
stories
mainroad
guestroom
13300000
7420
yes
no
12250000
8960
yes
no
12250000
9960
yes
no
12215000
7500
yes
no
11410000
7420
yes
yes
basement
hotwaterheating
airconditioning
parking
prefarea
furnishingstatus
no
no
yes
yes
furnished
no
no
yes
no
furnished
yes
no
no
yes
semi-furnished
yes
no
yes
yes
furnished
yes
no
yes
no
furnished
The output from both Pandas and PySpark is identical except that PySpark
doesn’t show an index. We can also see from the output that each of the
datasets contains 13 columns or attributes, which represent different aspects
of houses. These attributes include the sale price and the total area in square
feet that the house covers.
The price in the dataset is the target variable, while the other features are
the predictors. The goal is to make predictions by following the decision
tree structure that will be learned during the training phase.
78
[In]: print(pandas_df.shape)
The output from both Pandas and PySpark confirms that we have 545 rows
and 13
columns. In Pandas, we can count the number of null values in each column
using the isnull() and sum() methods:
[In]: print(pandas_null_counts)
[Out]:
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64
In PySpark, we can use the isNull() and sum() methods but in a slightly
different way:
[Out]:
79
area
bedrooms
bathrooms
stories
mainroad
guestroom
basement
hotwaterheating
airconditioning
parking
prefarea
furnishingstatus
0
0
Since the PySpark output is horizontal, unlike that of Pandas that fits
vertically within the page, we have split it into two tables to enhance
readability.
We can see from both Pandas and PySpark output that there are 0 null
values. This is good news as machine learning algorithms don’t usually
work well with missing data.
Next, we can use the describe() method in Pandas and the summary()
method in PySpark to produce key summary statistics for the numerical
variables.
[In]: pandas_df.describe()
[Out]:
price
area
bedrooms
bathrooms
stories
parking
count
545
545
545
545
545
545
mean
4766729
5150.54
2.97
1.29
1.81
0.69
std
1870440
2170.14
0.74
0.50
0.87
0.86
min
1750000
1650
25%
3430000
3600
50%
4340000
4600
2
0
75%
5740000
6360
max
13300000
16200
80
[Out]:
summary
price
area
bedrooms
bathrooms
stories
parking
count
545
545
545
545
545
545
mean
4766729
5150.54
2.97
1.29
1.81
0.69
stddev
1870440
2170.14
0.74
0.50
0.87
0.86
min
1750000
1650
25%
3430000
3600
2
50%
4340000
4600
75%
5740000
6360
max
13300000
16200
Finally, we can check the categorical variables in the housing dataset using
a for loop. To get the Pandas output, we loop over each column of the
categorical variables using a list of column names. For each column, we
perform the following operations:
Pandas Series.
• We then print the column name along with its unique values.
81
[Out]:
'unfurnished']
• The resulting column name is associated with its unique values in the
printed output.
unique_values = spark_df.select(col(col_name)).distinct()\
.rdd.flatMap(lambda x: x).collect()
[Out]:
82
'furnished']
The output of both Pandas and PySpark shows that 7 out of the 12 features
are categorical variables. This suggests that we need to perform one-hot
encoding on these variables to convert them into numerical values before
feeding them to the algorithm.
In this section, we build, train, and evaluate a decision tree regressor with
default hyperparameters and use it to predict house sale prices. Similar to
multiple linear regression in the previous chapter, this model is an example
of supervised learning. This is because we will be providing the target
variable along the features.
However, decision tree regressors come with their own set of limitations.
For example, their vulnerability to overfitting is a known issue, especially if
their hyperparameters are not set up properly. For example, if the number of
trees is too large to the extent that they are allowed to extend deeply into the
data and capture noise, this can lead to inadequate performance on new,
unseen data. Another aspect to consider is the inherent susceptibility to
minor data fluctuations, which can trigger significant alterations in the
tree’s structure, thereby diminishing its resilience and dependability. As a
result, decision trees can be inclined to generate considerably distinct trees
when trained on different subsets of the training data, making the model
susceptible to noise in the data. Since overfitting is an important topic in
data science, it will be covered in detail in this section.
83
Scikit-Learn/PySpark Similarities
1. Data preparation:
2. Model training:
84
like mean squared error and r2. Specify the predicted values
column (predictions) and the actual values column (y_test) for
evaluation.
4. Prediction:
As can be seen from this comparison, the modeling steps in both Scikit-
Learn and PySpark are similar. The implementation details and syntax may
differ, however, as Scikit-Learn is a Python-based library while PySpark is
a distributed computing framework based on Apache Spark.
Let’s now proceed with implementing these steps using decision tree
regression in both Scikit-Learn and PySpark, beginning with Scikit-Learn.
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
85
[In]: y = pandas_df['price']
We can now proceed with the modeling steps: importing necessary libraries,
preparing data, training the model, making predictions on test data, and
evaluating the model.
The first step in the modeling process is importing the necessary libraries.
This step allows us to access a wide range of pre-built functions, classes,
and tools that can significantly simplify and speed up our development
process compared to building, training, and evaluating the model by writing
code from scratch.
In this step, we prepare data in a format that is suitable for the machine
learning algorithm. It includes one-hot encoding categorical variables and
splitting data into training and testing sets.
86
feature_names_out(cat_cols))
test_size=0.2, random_state=42)
Step 3: Model training
Step 4: Prediction
In this step, we make predictions on the test set using the model we have
built and trained in the previous step.
This is our final step in the modeling process where we evaluate the
model’s performance using previously unseen (test) data. We calculate two
key performance statistics: the root mean squared error (RMSE) and the
coefficient of determination, R-squared.
testing sets.
87
evaluation metrics.
the MSE.
For the purpose of this project, data preparation in Scikit-Learn has two
substeps. In the first substep, the categorical columns specified in cat_cols
(mainroad, guestroom, basement, hot water heating, air conditioning,
preferred area, and furnishing status) are one-hot encoded using
OneHotEncoder. We start by initializing this class with the sparse parameter
set to False. This means that the resulting encoded matrix will have all the
1s and 0s for the encoded categories instead of just the 1s if we were to set
the parameter to True.
Thus, we utilize the drop method to remove them (the specified axis = 1
means we are dropping columns, not rows, while the parameter
inplace=True is used to indicate that the operation should modify the
original DataFrame X directly, without creating a new DataFrame).
& (X["bedrooms"] == 3)
& (X["bathrooms"] == 2)
& (X["stories"] == 4)
& (X["parking"] == 1)
.filter([
"mainroad",
"basement",
"hotwaterheating",
"airconditioning",
"prefarea",
"furnishingstatus"]))
[Out]:
mainroad
basement
hotwaterheating
airconditioning
prefarea
furnishingstatus
yes
no
no
yes
no
furnished
The initial section of the code uses the AND (&) operator to set the
following conditions:
Afterward, the following categorical columns are selected using the filter
method in the latter part of the code:
• mainroad
• basement
• hotwaterheating
89
• airconditioning
• prefarea
• furnishingstatus
mainroad_no
mainroad_yes
basement_no
basement_yes
0
1
hotwaterheating_no
hotwaterheating_yes
airconditioning_no
airconditioning_yes
furnishingstatus_
furnishingstatus_
no
yes
furnished
semi- furnished
unfurnished
1
0
Since the original mainroad value was yes for the specific row we have
selected, the mainroad_yes column for that row is set to 1, and the
mainroad_no column is set to 0 after one-hot encoding is applied. This is
the same scenario with air conditioning.
Since the original value for the selected record was yes, the new columns
are set to 1 for airconditioning_yes and 0 for airconditioning_no.
90
The furnishing status might look a bit more complicated, since it has three
values instead of a binary yes or no, but the one-hot encoding process
works in exactly the same way. Since the original value was furnished for
the furnished status, the encoded furnishingstatus_furnished column is set
to 1, while the furnishingstatus_semi-furnished and
furnishingstatus_unfurnished columns are both set to 0.
The model is trained on the training data using the fit() method.
After the model is trained, we can print the importance of each of its
features as follows:
[Out]:
area : 0.4976
bathrooms : 0.1759
stories : 0.0512
parking : 0.0412
bedrooms : 0.0393
furnishingstatus_unfurnished : 0.0318
airconditioning_no : 0.0242
prefarea_no : 0.0233
airconditioning_yes : 0.0206
hotwaterheating_no : 0.0191
basement_no : 0.0158
guestroom_yes : 0.0134
guestroom_no : 0.0123
prefarea_yes : 0.0099
92
mainroad_yes : 0.0063
furnishingstatus_semi-furnished : 0.0046
mainroad_no : 0.0023
furnishingstatus_furnished : 0.0017
hotwaterheating_yes : 0.0007
Step 4: Prediction
We can now compare the top five rows of the actual and predicted target
values to have some idea about the model’s predictions. We can achieve this
with the following code:
[In]: print(results_df.head())
[Out]:
Price
Prediction
4060000
5600000
6650000
7840000
3710000
3850000
6440000
4935000
2800000
2660000
93
In this step, evaluation metrics are calculated using the predicted values
y_pred and the actual values y_test. The R-squared score and root mean
squared error (RMSE) are calculated. They can be printed with the
following code:
An acceptable percentage of the root mean squared error is one that is less
than 10%
of the mean of the actual target variable. In the specific example, the mean
price across all houses in the sample is approximately 4,766,729:
[In]: y.mean()
[Out]: 4766729.25
This indicates that the RMSE exceeds 35% of that value, implying that the
model’s performance is not as high as one would have hoped.
94
So what explains the model’s modest performance? Well, decision trees are
known to overfit because a standard model is typically trained on a single
tree, and we cannot directly increase the number of trees as we would in
ensemble methods like random forest or gradient boosting. There is also the
curse of dimensionality: the current regression model originally had 12
features, but after applying one-hot encoding, it added 8 new features,
resulting in a total of 20 columns. This increase in dimensionality could
lead to unnecessary complexity, a concern that is amplified by the fact that
our sample size is relatively small, comprising only 545 records.
Overfitting
One way to detect overfitting is the train/test split. We divide the data into
two separate sets: a training set and a test set. We train the model on the
training set and evaluate its performance on the test set. If the model
performs significantly better on the training set compared to the test set, it
may be overfitting.
We have already done most of this. We only need to calculate the train r2
and compare it with that of the test subsample. We can print the r2 of the
training sample as follows:
The r2 on the train set is over 0.99, meaning that the model’s predictions
can explain almost 100% of the variability in the target variable. This is
massively higher than the 42%
95
We can plot the learning curve for the current DecisionTreeRegressor with
the following code:
cv=None,
n_jobs=None, train_sizes=np.linspace(0.1, 1.0, 5)):
plt.figure()
plt.title(title)
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes=train_sizes)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean -
train_scores_std,
96
Chapter 4 DeCision tree regression with panDas, sCikit-Learn, anD pyspark
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean -
test_scores_std,
label="Training score")
label="Cross-validation score")
plt.legend(loc="best")
return plt
Step 3: Plot the learning curve
[Out]:
The code demonstrates the creation of a learning curve plot to assess the
performance of the decision tree regressor on the housing dataset. The goal
is to observe how the model’s performance changes as the size of the
training dataset increases.
The code begins by importing the necessary libraries, including NumPy for
numerical operations, Matplotlib for visualization, and the learning_curve
module for generating learning curves. A custom function named
plot_learning_curve is then 97
For this, we have set the default (0.1, 1.0, 5), meaning the starting point of
the range is 0.1, or 10% of the dataset size, the ending point of the range is
1.0 representing 100%
of the dataset size, and 5 representing the number of evenly spaced values
within the specified range.
Inside the function, a Matplotlib figure is created, and the title is set. If y-
axis limits are provided, they are applied to the plot. The x axis represents
the number of training examples used for model training. The x-axis values
are generated based on the train_
The plot provides insights into how the decision tree regression model’s
performance changes as the size of the training set increases. The training
score line and points represent the average R2 score obtained on the
training data across the different training set sizes. The line’s trajectory
indicates how well the model’s predictions fit the training data as the
training set size increases. The points show the R2 score for each individual
training set size.
The cross-validation score line and points represent the average R2 score
obtained through cross-validation across the different training set sizes.
Cross-validation involves training the model on different subsets of the data
and testing it on the remaining data.
The line’s trajectory shows how well the model generalizes to new, unseen
data as the training set size increases.
98
Chapter 4 DeCision tree regression with panDas, sCikit-Learn, anD pyspark
The shaded area around the cross-validation score line represents the
standard deviation of the R2 score. The area provides a sense of the
variability in the scores obtained during the cross-validation process. Wider
shaded areas indicate higher variability in performance.
So now that we know that the model overfits, what are the options? There
are several strategies to address overfitting:
• More training data: Our dataset is relatively small with only 545
and one might perform better for our specific data. We will explore
this option in the next two chapters as we delve into regression with
model is too complex. This is likely the case here as our model has
99
number of features, such as the area of the house and the number
• Z-score: This involves calculating the z-score for each data point
on the difference between the upper quartile (Q3) and the lower
quartile (Q1) of the data. Data points outside the range Q1 - 1.5 *
getOrCreate()
100
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
The modeling steps in PySpark are largely similar to the steps in Scikit-
Learn: importing necessary libraries, data preparation, model training,
prediction, and model evaluation. Two additional substeps in data
preparation, however, are needed: using StringIndexer before applying
OneHotEncoder and combining features into a vector using
VectorAssembler.
spark_df = indexer.fit(spark_df).transform(spark_df)
101
OceanofPDF.com
2.3. Combining all features into a single vector
outputCol='features')
4. Prediction
5. Model evaluation
predictionCol='prediction', metricName='rmse')
predictionCol='prediction', metricName='r2')
[In]: r2 = evaluator_r2.evaluate(predictions)
Let’s go through the code step by step to mimic what we did with Scikit-
Learn: Step 1: Importing necessary libraries
102
VectorAssembler)
• Regression (DecisionTreeRegressor)
• Evaluation (RegressionEvaluator)
This step involves preparing the data for training and testing the model. It
has four substeps:
The categorical columns in the dataset are specified in the cat_columns list.
For each categorical column, a StringIndexer is created, which converts
categorical values into numerical indices. The handleInvalid parameter is
set to keep, meaning that if a category unseen during training appears in the
test set, it won’t throw an error but will be handled appropriately.
The following is how the indexed data looks like using the mainroad
categorical column as an example:
[In]: spark_df.select("mainroad", "mainroad_label").distinct().show()
[Out]:
+--------+--------------+
|mainroad|mainroad_label|
+--------+--------------+
| yes| 0.0|
| no| 1.0|
+--------+--------------+
103
It takes the labeled columns from the previous step and transforms them
into one-hot encoded vectors. Each original categorical column is now
represented by a set of binary columns indicating the presence of a specific
category.
distinct().show()
[Out]:
+--------+--------------+----------------+
|mainroad|mainroad_label|mainroad_encoded|
+--------+--------------+----------------+
+--------+--------------+----------------+
Let’s pull one single row of the features column, which contains the
assembled features, to see how the assembled vector looks:
[Out]:
(20,[0,1,2,3,4,5,7,9,11,14,16,19],
[7420.0,4.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
The number 20, before the first bracket, refers to the total number of
features, including both original numerical features (12) and those that have
been one-hot encoded (8):
furnishing status.
• One additional slot for the furnishing status column with three
There are 12 indices and 12 corresponding values in the first bracket and
second bracket, respectively. These indices and values correspond to the 12
features in the dataset, including both original numerical features and one-
hot encoded features.
In this substep of the data preparation step, the data is split into training and
testing sets using the randomSplit method. Here, 80% of the data is
allocated for training and 20% for testing. A seed value of 42 is provided to
ensure reproducibility.
105
[In]: feature_importances_dict = {}
[In]: for i in range(len(feature_names)):
feature_name = feature_names[i]
importance = feature_importances[i]
feature_importances_dict[feature_name] = importance
descending order.
[Out]:
area: 0.5012
bathrooms: 0.1887
stories: 0.0595
parking: 0.0495
basement_encoded: 0.0241
furnishingstatus_encoded: 0.0084
airconditioning_encoded: 0.0079
bedrooms: 0.0044
mainroad_encoded: 0.0
guestroom_encoded: 0.0
106
prefarea_encoded: 0.0
We observe from the output that the OneHotEncoder has not created
multiple binary columns like in Scikit-Learn where each one-hot encoded
variable was expanded into two columns (_yes and _no). Instead, PySpark
has created sparse vectors for the encoded categorical columns. For
example, for the furnishing status, there is only onehot encoded column
named furnishingstatus_encoded.
Step 4: Prediction
In this step of the PySpark modeling process, the trained model is used to
make predictions on the test data using the transform method. The resulting
predictions are added as a new column in the predictions DataFrame.
We can print the top five rows of the actual vs. predicted target values as
follows:
[Out]:
+-------+-----------------+
| price| prediction|
+-------+-----------------+
|6930000| 4566692|
|7070000| 5197333|
|7210000| 6940863|
|7350000| 5197333|
|7455000| 2791250.0|
+-------+-----------------+
We can see significant differences between the actual target variable (price)
and the predicted target variable (prediction). For example, the first row
indicates that the model underestimates the true price by close to 35%
(4,566,692 vs. 6,930,000). This suggests that the model’s performance is
likely to be modest, just as it was in Scikit-Learn. We can confirm this by
looking at the RMSE and r2 statistics.
Two different evaluation metrics are computed for the model’s performance
on the test data. The RegressionEvaluator is used to calculate the root mean
squared error (RMSE) and the R-squared (r2) values, both of which provide
insights into how well the model’s predictions align with the actual target
values.
107
Based on our earlier investigation (comparing r2 between test and train sets
and the learning curve in Scikit-Learn), this is likely due to the model
overfitting because of the relatively large number of features and the
relatively small sample size.
predictionCol='prediction', metricName='r2')
The code to calculate and print the r2 for the train set begins by using the
trained spark_model to make predictions on the training data, resulting in
the train_predictions DataFrame. Then, a RegressionEvaluator object is
created, specifying that it should evaluate the R2 metric between the “price”
column (actual target) and the “prediction”
Clearly, there is a substantial difference between the R2 score of the test set
(0.30) and the R2 score of the training set (0.72), which is indicative of
overfitting. This confirms what we have already known from the Scikit-
Learn results.
108
In this section, we have combined all the modeling steps into one code
block to make it easier for the data scientist to replicate the results in this
chapter and to understand better how all the modeling steps fit together.
Scikit-Learn
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
[In]: y = pandas_df['price']
109
feature_names_out(cat_cols))
size=0.2, random_state=42)
# Train the model
[In]: print(results_df.head())
110
The following code shows how decision tree regression can be done in
PySpark using the same open source housing dataset used to train the
Scikit-Learn decision tree regressor:
VectorAssembler
[In]: spark =
SparkSession.builder.appName("DecisionTreeRegressorExample").
getOrCreate()
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
spark_df = indexer.fit(spark_df).transform(spark_df)
111
seed=42)
[In]: feature_importances_dict = {}
feature_name = feature_names[i]
importance = feature_importances[i]
feature_importances_dict[feature_name] = importance
112
predictionCol='prediction', metricName='rmse')
predictionCol='prediction', metricName='r2')
[In]: r2 = evaluator_r2.evaluate(predictions)
Summary
113
CHAPTER 5
Random Forest
Despite both regression models utilizing decision trees, they exhibit notable
distinctions.
The objectives of this chapter are twofold. First, we will use Scikit-Learn
and PySpark to build, train, and evaluate a random forest regression model,
concurrently drawing parallels between the two frameworks. Subsequently,
we will assess the hypothesis that random forests outperform decision trees
by applying the random forest model to the 115
Before we build the random forest regression model, let’s revisit the
housing dataset to refresh our understanding of its fundamental attributes.
Additionally, we can demonstrate that we can perform the same exploratory
data analysis with PySpark as we usually do with Pandas, all while
harnessing the power of PySpark’s distributed computing.
The Dataset
url = ('https://raw.githubusercontent.com/abdelaziztestas/’
'spark_book/main/housing.csv')
return pd.read_csv(url)
Recall from the previous chapter that this dataset had 545 rows and 13
columns. We can confirm this by using the Pandas shape attribute:
[In]: print(pandas_df.shape)
We can list the name of the columns or labels using the columns attribute
for both Pandas and PySpark:
116
[In]: print(pandas_df.columns)
dtype='object')
PySpark:
[In]: print(spark_df.columns)
Remember that the column attribute doesn’t show an index in the PySpark
output as the two libraries handle indexing differently.
We can choose to turn off the display of the index in Pandas to align with
PySpark with the following code:
df.columns]) + ']'
[In]: print(column_labels)
The price variable is our label or target variable, while the remaining
variables are the features.
Pandas:
[In]: print(pandas_df.dtypes)
[Out]:
price int64
area int64
bedrooms int64
117
stories int64
mainroad object
guestroom object
basement object
hotwaterheating object
airconditioning object
parking int64
prefarea object
furnishingstatus object
dtype: object
PySpark:
print(col_name, col_type)
[Out]:
price bigint
area bigint
bedrooms bigint
bathrooms bigint
stories bigint
mainroad string
guestroom string
basement string
hotwaterheating string
airconditioning string
parking bigint
prefarea string
furnishingstatus string
The numerical variables have the int64 data type in Pandas and bigint in
PySpark, while the categorical variables have the object and string data
types, respectively.
There are, therefore, seven categorical variables that will need to be one-hot
encoded (mainroad, guestroom, basement, hotwaterheating, airconditioning,
prefarea, furnishingstatus).
118
Pandas:
unique_values = pandas_df[col].unique()
[Out]:
'unfurnished']
PySpark:
unique_values=spark_df.select(col(col_name)).distinct().rdd.
flatMap(lambda x: x).collect()
[Out]:
'furnished']
119
Before building the random forest regression model, let’s first compare the
modeling steps in Scikit-Learn and PySpark.
Scikit-Learn/PySpark Similarities
1. Data preparation:
approach.
120
3. Model evaluation:
Scikit-Learn: The assessment of random forest regression model
4. Prediction:
(y_pred).
vector structure.
121
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
[In]: y = pandas_df['price']
In the next step, the code creates two new DataFrames: X containing the 12
input features and y holding the target variable (price). This separation
between X and y allows for supervised learning tasks where the goal is to
predict the target values based on the input features in X.
In this step, we import the classes and functions required for data
preparation, model training, model evaluation, and data manipulation:
122
There are two substeps involved here: encoding categorical variables into
numerical values and splitting the dataset into training and testing sets:
feature_names_out(cat_cols))
size=0.2, random_state=42)
The aforementioned five modeling steps indicate that in just a few lines of
code, we have prepared the data, built the random forest model, trained the
model, evaluated it, and made predictions on the testing set.
123
Chapter 5 random Forest regression with pandas, sCikit-Learn, and pyspark
Let’s go through the steps in more detail:
In this step, the libraries for preparing data, building, training, and
evaluating the random forest regression model are imported. These modules
include
The dataset is divided into training and testing subsets using the train_test_
split() method. The parameter test_size specifies the proportion of data to
be allocated for testing (0.2 or 20%), while random_state ensures
reproducibility of the split. The resulting training and testing feature sets
(X_train and X_test) and corresponding target sets (y_train and y_test) are
created for subsequent model training and evaluation.
124
[In]: print(random_forest_model.n_estimators)
[Out]: 100
This indicates that the model has been built using 100 trees, which is the
default number of trees in Scikit-Learn. This is a hyperparameter that can
be changed as we will see in Chapter 16 where we explore the topic of
hyperparameter tuning in greater detail.
We can also print the feature importances of the model, just as we did in the
previous chapter for the decision tree algorithm:
[Out]:
area: 0.4571
bathrooms: 0.1592
stories: 0.0580
parking: 0.0542
bedrooms: 0.0463
furnishingstatus_unfurnished: 0.0350
airconditioning_yes: 0.0282
airconditioning_no: 0.0281
basement_no: 0.0188
basement_yes: 0.0167
prefarea_yes: 0.0145
prefarea_no: 0.0140
furnishingstatus_furnished: 0.0110
hotwaterheating_no: 0.0110
furnishingstatus_semi-furnished: 0.0109
hotwaterheating_yes: 0.0099
guestroom_no: 0.0094
125
mainroad_no: 0.0055
mainroad_yes: 0.0049
Similar to our conclusion in the previous chapter regarding the decision tree
regression model, we find that the area of the house and the number of
bathrooms remain the primary drivers for the model’s predictions.
Conversely, from the preceding output, we observe that residing on the
main road (yes or no) has the least impact on the model’s predictions.
forest_model.estimators_]
[Out]:
Depth of each tree: [16, 18, 17, 19, 16, 18, 15, 15, 18, 15, 18, 17, 17, 18, 18,
15, 19, 17, 17, 16, 19, 19, 18, 17, 16, 15, 17, 15, 19, 18, 16, 19, 17, 16, 14,
14, 17, 17, 15, 18, 18, 16, 19, 14, 17, 16, 18, 18, 19, 17, 15, 19, 18, 18, 17,
16, 17, 18, 16, 17, 15, 17, 16, 19, 20, 17, 15, 18, 17, 19, 16, 15, 17, 17, 17,
15, 16, 17, 16, 14, 20, 17, 17, 16, 14, 17, 19, 15, 14, 17, 18, 16, 17, 18, 17,
20, 19, 18, 16, 17]
The output represents the maximum depth of each of the 100 individual
decision trees in the random forest model. The depth of a decision tree
refers to the number of levels or layers of splits it has in its structure. Each
level represents a decision point based on a feature, and the tree’s leaves
hold the final predictions.
There are seven distinct values (14, 15, 16, 17, 18, 19, 20) in the preceding
output, meaning that multiple trees share the same value. The maximum
depths of the individual decision trees are typically set to None in Scikit-
Learn, which means that the trees will be expanded until they contain less
than the minimum samples required to split a node. This contrasts with
PySpark, as we will see in the next subsection, as the default maximum
depth of a tree is set to 5.
Let’s take a look at a few values from the list and understand their meaning.
For example, the first four values, [16, 18, 17, 19], indicate that the first
decision tree in the ensemble has a maximum depth of 16. This means that
the tree has been split into a series of decisions up to 16 levels deep. It can
potentially create a tree structure with 2^16
(65,536) leaf nodes. Similarly, the second decision tree has a maximum
depth of 18. This 126
In general, a deeper decision tree can model the training data more
accurately, as it can capture finer details and interactions between features.
The ensemble nature of a random forest helps mitigate overfitting by
aggregating the predictions of multiple decision trees.
(Note, however, that excessively deep trees can also lead to overfitting,
where the model fits the training data too closely and doesn’t generalize
well to new, unseen data.) Step 4: Making predictions on the testing set
In this step, the trained random forest model is used to predict the target
variable for the testing feature set (X_test). The predicted values are stored
in the y_pred array.
We can compare a sample of the actual and predicted target values as
follows:
[In]: print(results_df.head())
Price
Prediction
4060000
5499270
6650000
7122220
3710000
3826900
6440000
4461100
2800000
3644757
If the model were 100% accurate, the Price column (actual price of a house)
and the Prediction column (predicted price) would be exactly the same.
However, there are differences between the two columns. In the first row,
for example, the actual price is $4,060,000, while the predicted price is
$5,499,270. This indicates that the model overestimates the price by
$1,439,270 or 35% of the actual price.
We can print the values of these two metrics with the following code:
With the evaluation metrics generated from the random forest regression
model, we can now compare their values with those we obtained in the
previous chapter for the decision tree regression model. This is to test the
hypothesis that random forest regression models perform better than
decision trees. Here’s the comparison: Model
RMSE
r2
random forest
1,388,213
0.62
decision tree
1,715,691
0.42
The preceding output indicates that the random forest regression model
indeed outperforms the decision tree regression model. This is evident in its
substantially lower RMSE (1,388,213 vs. 1,715,691) and notably higher r2
(0.62 vs. 0.42). In the context of the random forest model, this signifies that
approximately 62% of the variations in house prices can be attributed to
variations in the features of the house. In contrast, the decision tree
regression model explains around 42% of the variations.
It remains true, however, that even the random forest model’s r2 of 62%
isn’t considered high. This suggests that even though this model has
significantly mitigated the overfitting of the decision tree model, in a real-
world scenario, more steps will be taken to improve the model, such as
reducing its complexity through feature extraction or feature selection. As
we will see in Chapter 16, fine-tuning a model’s hyperparameters to find the
most optimal combination of values is another way to improve its
performance.
128
In the following code, we first create a Pandas DataFrame and then convert
it to PySpark using the createDataFrame() method.
getOrCreate()
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
In this code, the first line imports the Pandas library and names it pd. The
second line imports the SparkSession class from the PySpark library, which
is the entry point to utilizing Spark functionalities. The third line establishes
a Spark Session named
We will now write PySpark code to prepare the housing data, build and
train the random forest regression model using the training subsample, and
then evaluate it using the test set. As a first step, we will need to import the
necessary libraries: 129
VectorAssembler
spark_df = indexer.fit(spark_df).transform(spark_df)
outputCol='features')
seed=42)
130
predictionCol='prediction', metricName='rmse')
predictionCol='prediction', metricName='r2')
[In]: r2 = evaluator_r2.evaluate(predictions)
The preceding steps demonstrate how we can efficiently prepare data, build
the model, train it, and evaluate its performance using just a few lines of
code. With these steps outlined, let’s now provide a detailed explanation for
each.
In this step, the code imports libraries that enable various stages of model
construction:
VectorAssembler
+----------------+----------------------+------------------------+
|furnishingstatus|furnishingstatus_label|furnishingstatus_encoded|
+----------------+----------------------+------------------------+
+----------------+----------------------+------------------------+
132
(20,[0,1,2,3,4,5,7,9,11,14,16,19],
[7420.0,4.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
There are 20 features as indicated by the number before the square brackets.
This is the sum of the original numerical features (12) and those that have
been onehot encoded (8). In other words, five original numerical features
(area, bedrooms, bathrooms, stories, and parking), seven one-hot encoded
features (for the seven categorical columns, mainroad, guestroom,
basement, hot water heating, air conditioning, preferred area, and furnishing
status, with two unique values each, that is, yes or no), and one additional
slot for the furnishing status column with three unique values (furnished,
semi-furnished, unfurnished).
There are 12 indices and 12 corresponding values in the first bracket and
second bracket, respectively. These indices and values correspond to the 12
features in the dataset, including both original numerical features and one-
hot encoded features.
The dataset is divided into training and testing sets for model development
and evaluation. Using the randomSplit() method, the data is partitioned into
an 80% training set and a 20% test set. The split is carried out with a
specific seed value (42) for reproducibility.
We can check the number of decision trees used in the model and the depth
of each tree with the following code:
[In]: tree_depths = []
depth = tree.depth
tree_depths.append(depth)
133
Chapter 5 random Forest regression with pandas, sCikit-Learn, and pyspark
After importing the Counter class from the collections module, the code
initializes an empty list to store tree depths. Next, it iterates through each
tree and retrieves its depth. The code then counts the occurrences of each
tree depth and finally prints the depths and their counts.
The output indicates that there are 20 trees, each with a depth of 5. These
are the default values in PySpark, which differ from those of the Scikit-
Learn algorithm: Scikit-Learn uses 100 trees by default and does not have a
specific default value for the depth of a tree. These variations can lead to
differences in the model’s performance.
[Out]:
area: 0.3693
bathrooms: 0.2030
stories: 0.0604
bedrooms: 0.0545
parking: 0.0446
134
basement_encoded: 0.0094
prefarea_encoded: 0.0087
guestroom_encoded: 0.0080
mainroad_encoded: 0.0072
airconditioning_encoded: 0.0072
hotwaterheating_encoded: 0.0045
The code begins by directly printing the feature importances obtained from
the random forest model. It then retrieves the column names from the
feature_cols list.
The output indicates that the area of the house and the number of bathrooms
are the main drivers behind the random forest predictions. Air conditioning
and hot water heating are the least important features in the model.
Step 4: Prediction
With the trained model, predictions are made on the test data. The transform
method is applied to the test dataset using the trained model, yielding
predicted values for the target variable.
[Out]:
Price
Prediction
6930000
5198444
7070000
4911813
7210000
7193449
7350000
5423289
7455000
3970584
135
RMSE measures the average prediction error, while R-squared assesses the
model’s goodness of fit to the data. The evaluate method calculates these
metrics based on the predicted and actual target values, providing insights
into the model’s accuracy and predictive capabilities.
We can print the values of the two metrics with the following code:
Model
RMSE
r2
random forest
1,237,365
0.50
decision tree
1,463,512
0.32
136
Chapter 5 random Forest regression with pandas, sCikit-Learn, and pyspark
Bringing It All Together
Scikit-Learn
The following is the code to prepare data, build and train the model, and
evaluate its performance:
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
[In]: y = pandas_df['price']
137
feature_names_out(cat_cols))
size=0.2, random_state=42)
# Train the Random Forest model
estimators)
forest_model.estimators_]
138
Chapter 5 random Forest regression with pandas, sCikit-Learn, and pyspark
[In]: print(results_df.head())
PySpark
Provided here is the code to prepare the data, build and train the random
forest in PySpark, and evaluate its performance:
VectorAssembler
[In]: spark =
SparkSession.builder.appName("RandomForestRegressorExample").
getOrCreate()
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
139
spark_df = indexer.fit(spark_df).transform(spark_df)
outputCol='features')
seed=42)
[In]: tree_depths = []
depth = tree.depth
tree_depths.append(depth)
140
importances))
predictionCol='prediction', metricName='rmse')
[In]: r2 = evaluator_r2.evaluate(predictions)
141
The chapter also showed that the steps of data preparation and model
development were similar between Scikit-Learn and PySpark. During this
process, parallels between the two frameworks were drawn. Subsequently,
the hypothesis that random forests outperformed decision trees was
assessed by applying the random forest model to the same housing dataset.
The hypothesis was accepted as random forests showed higher R2
and lower RMSE values. The chapter also demonstrated that Pandas and
PySpark shared similar approaches to reading and exploring data.
142
CHAPTER 6
Gradient-Boosted Tree
On the other hand, even though both GBTs and random forests are tree-
based ensemble algorithms, they exhibit distinct training processes. More
specifically, GBTs progress by training one tree at a time, whereas random
forests have the capability to 143
Additionally, it’s worth noting that while both random forests and GBTs
mitigate overfitting in contrast to decision trees, random forests tend to
exhibit a lesser susceptibility to overfitting than GBTs. This means that
increasing the number of trees in a random forest diminishes the risk of
overfitting, whereas increasing the number of trees in GBTs increases the
likelihood of overfitting.
The Dataset
For this chapter’s project, we will be using the same housing price dataset
we used in the previous two chapters for decision tree and random forest
modeling. The rationale behind this is that using the same dataset would
make it easier for the reader to follow and compare the results between the
last three chapters.
Since we have explored this dataset in great detail in the previous chapters,
our analysis of the data in this chapter will be brief. The main purpose is
only to remind ourselves of the main attributes of the dataset and how
Pandas and PySpark can be used interchangeably for exploratory data
analysis with PySpark having the advantage of handling much larger
volumes due to the nature of its distributed computing.
url = ('https://raw.githubusercontent.com/abdelaziztestas/’
'spark_book/main/housing.csv')
return pd.read_csv(url)
Recall from the previous chapters that this dataset had 545 rows and 13
columns (12
features and 1 target variable). We can confirm this by using the Pandas
shape attribute:
[In]: print(pandas_df.shape)
We can list the name of the columns or labels using the columns attribute
for both Pandas and PySpark:
Pandas:
[In]: print(pandas_df.columns)
'furnishingstatus'], dtype='object')
PySpark:
[In]: print(spark_df.columns)
145
[In]: print(pandas_df.dtypes)
[Out]:
price int64
area int64
bedrooms int64
bathrooms int64
stories int64
mainroad object
guestroom object
basement object
hotwaterheating object
airconditioning object
parking int64
prefarea object
furnishingstatus object
dtype: object
PySpark:
print(col_name, col_type)
[Out]:
price bigint
area bigint
bedrooms bigint
bathrooms bigint
stories bigint
mainroad string
guestroom string
basement string
hotwaterheating string
airconditioning string
parking bigint
prefarea string
furnishingstatus string
146
There are, therefore, seven categorical variables that will need to be one-hot
encoded (mainroad, guestroom, basement, hotwaterheating, airconditioning,
prefarea, furnishingstatus).
Six of these categorical variables have two categories each (yes or no, or
vice versa), while the furnishing status column has three categories:
furnished, semi-furnished, and unfurnished. We can confirm this with the
following code:
Pandas:
unique_values = pandas_df[col].unique()
print(f"Unique values in {col}: {unique_values}")
[Out]:
'unfurnished']
PySpark:
'furnishingstatus']: unique_values=spark_df.select(col(col_name)).
distinct().rdd.flatMap(lambda x: x).collect()
[Out]:
147
'furnished']
Now that we have reminded ourselves of the main attributes of the housing
dataset, we can proceed to build the gradient-boosted tree (GBT) regression
model.
The weights assigned to each tree depend on its performance in the training
set and the number of trees in the model. The term “gradient” in GBT refers
to the fact that the algorithm uses gradient descent optimization to minimize
the loss function of the model, which is typically the mean squared error for
regression problems.
Scikit-Learn/PySpark Similarities
Scikit-Learn and PySpark share the same modeling steps when conducting
gradient-boosted tree regression. As we’ll find out here, they adhere to
comparable processes for building, training, and evaluating the gradient-
boosted tree regressor: 148
split are used to partition data into training and testing sets.
2. Model training:
model and employs the fit method. This involves providing the
that holds the features and the target variable is provided. The
3. Model evaluation:
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
[In]: y = pandas_df['price']
150
pointing to the CSV file location is passed to the Pandas read_csv() method.
This reads the CSV file and returns a Pandas DataFrame. The pandas_df
DataFrame is created once we call the load_housing_data() function.
The code then creates two new DataFrames: X containing the 12 features
and y holding the target (price) variable. Separating X and y allows the
gradient-boosted tree algorithm to predict the y target values based on the X
input features.
Having generated the X and y DataFrames, we can now start the modeling
process in Scikit-Learn by first importing the necessary libraries.
feature_names_out(cat_cols))
test_size=0.2, random_state=42)
151
Chapter 6 Gradient-Boosted tree reGression with pandas, sCikit-Learn, and
pyspark Step 4: Prediction
One key observation from the preceding steps is that in less than 20 code
lines, we were able to prepare the data, create and train the gradient-boosted
tree model, make predictions, and evaluate the model’s performance.
In this first step of the modeling process, essential libraries are imported to
enable various aspects of building, training, and evaluating the GBT
regression model. The libraries include Pandas for data manipulation,
Scikit-Learn’s GradientBoostingRegressor class for creating the regression
model, the train_
test_split function for splitting data into training and testing sets, r2_score
and mean_squared_error for model evaluation, and OneHotEncoder for
encoding categorical variables.
The data is divided into training and testing sets using the train_test_split
function. This division is important for training the model on one subset of
the data and evaluating its performance on another independent subset. The
function takes 152
Step 4: Prediction
In this step, the trained GBT regression model is employed to predict the
target variable (y_pred) for the test data (X_test) using the predict() method.
This step enables us to assess how well the model’s predictions align with
the actual values.
The code in this step calculates two key evaluation metrics: R-squared (r2)
and root mean squared error (RMSE). R-squared measures the proportion of
the variance in the dependent variable (price in this case) that’s predictable
from the independent variables. RMSE quantifies the average difference
between predicted and actual values, providing insight into the model’s
prediction accuracy and overall performance.
Before we print the RMSE and r2 of the model, it’s worth going through a
few points of consideration. First, since we didn’t specify the number of
estimators (also known as boosting stages), the model will use the default
number of trees, which is 100 in Scikit-Learn. We can confirm this with the
following code:
[In]: print(gbt_regressor.n_estimators_)
[Out]: 100
The model also uses the default maximum depth of the individual trees
(base learners), which is set to 3. The following code and output confirm
this:
[In]: print(gbt_regressor.max_depth)
[Out]: 3
The depth of a decision tree refers to the number of levels or layers of splits
it has in its structure. Each level represents a decision point based on a
feature, and the tree’s leaves hold the final predictions. In our example, each
of the 100 trees will have a maximum depth of 3, meaning that each tree
has been split into a series of decisions up to 3 levels deep. It can
potentially create a tree structure with 2^3 (8) leaf nodes.
153
Like decision trees and random forests in the preceding chapters, the
gradient-boosted tree algorithm allows us to extract the list of feature
importances. We can do this with the following code:
[Out]:
area : 0.4607
bathrooms : 0.1673
airconditioning_no : 0.0556
parking : 0.0511
stories : 0.0455
bedrooms : 0.0441
airconditioning_yes : 0.0350
furnishingstatus_unfurnished : 0.0297
basement_no : 0.0222
prefarea_yes : 0.0199
hotwaterheating_no : 0.0134
mainroad_no : 0.0129
basement_yes : 0.0106
guestroom_no : 0.0089
prefarea_no : 0.0069
guestroom_yes : 0.0048
mainroad_yes : 0.0037
furnishingstatus_furnished : 0.0030
furnishingstatus_semi-furnished : 0.0023
hotwaterheating_yes : 0.0023
154
The trained random forest model is used to predict the target variable for
the testing feature set (X_test). The predicted values are stored in the
y_pred array.
[Out]:
Price
Prediction
4,060,000
4,502,828
6,650,000
7,301,498
3,710,000
3,697,702
6,440,000
4,415,743
2,800,000
3,722,737
We can tell from the output that the accuracy of the model isn’t 100%
because there are discrepancies between the actual and predicted prices. In
the first row, for example, the actual price is $4,060,000, while the
predicted price is $4,502,828, indicating that the model overestimates the
price. The model also overestimates the price in the second row (7,301,498
vs. 6,650,000), as well as the fifth row (3,722,737 vs. 2,800,000). In the
third and fourth rows, however, the model underestimates the price.
155
Now that we have obtained the values of the evaluation metrics for the
gradient-boosted tree regression model, we can compare them with the
values from the preceding two chapters for the random forest and decision
tree regression models.
Model
RMSE
r2
Gradient-boosted tree
1,298,307
0.67
random forest
1,388,213
0.62
decision tree
1,715,691
0.42
In this table, the first column represents the type of algorithm used. The
second column contains the values of the RMSE, which measures the
average difference between predicted and actual values. A lower RMSE
indicates better model performance. The last column contains r2 (R-
squared), which is a measure of how well the model’s predictions match the
variability of the actual data. It ranges from 0 to 1, with higher values
indicating a better fit.
The differences in performance could also mean that the particular dataset
we are working on has patterns and relationships that GBT is particularly
good at capturing.
GBT has a sequential nature that aims to correct the errors of previous trees,
which can be advantageous when dealing with complex relationships in
data.
156
We will now proceed to build the same GBT regression model and train it
using the same housing dataset using PySpark. First, let’s have a quick
reminder of the code that we used to set up PySpark in the “The Dataset”
section.
[In]: spark =
SparkSession.builder.appName("GBTRegression").getOrCreate()
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
In this code, several steps are taken to work with the housing data using
both the Pandas library and the PySpark framework. The first line imports
the Pandas library and names it pd. The second line imports the
SparkSession, which is the entry point to utilizing Spark functionalities.
The code then establishes a Spark Session using the
SparkSession.builder.appName() method. This session serves as the
connection to the Spark cluster and provides the environment to perform
distributed data processing.
We will now write the PySpark code for the steps to prepare data, build and
train the model, and evaluate its performance using the test set. First, let’s
import the necessary libraries:
157
VectorAssembler
spark_df = indexer.fit(spark_df).transform(spark_df)
outputCol='features')
seed=42)
predictionCol='prediction', metricName='rmse')
predictionCol='prediction', metricName='r2')
[In]: r2 = evaluator_r2.evaluate(predictions)
In this first step of the modeling process, libraries for working with PySpark
are imported. These include StringIndexer, OneHotEncoder, and
VectorAssembler for data preparation, GBTRegressor for performing
regression analysis, and RegressionEvaluator for evaluating the model’s
performance. The Pandas library is also imported and named pd.
In this substep of the data preparation step, the categorical columns in the
dataset are identified. Each categorical column is then label-encoded using
StringIndexer, generating a corresponding column with _label suffix in the
DataFrame.
In the first line, a list named cat_columns is created, containing the names
of categorical columns in the dataset that require encoding. These
categorical columns represent binary attributes such as mainroad,
guestroom, basement, and others. The code then proceeds to create a list
named indexers, which will store instances of the StringIndexer
transformation for each categorical column. The transformation is
initialized by specifying the input column (e.g., mainroad) and the desired
output 159
encoded suffix.
The code starts by creating an instance of the OneHotEncoder by specifying
two lists: inputCols and outputCols. The inputCols list is generated using a
list comprehension that iterates through the categorical columns stored in
cat_columns and appends _label to each column name. This aligns with the
previously created categorical label columns from the StringIndexer step.
Similarly, the outputCols list is generated by appending _encoded to each
corresponding categorical column name.
The first line of code defines a list named feature_cols, containing the
names of the features that will be used for modeling. These features include
numerical attributes like area, bedrooms, bathrooms, stories, and parking.
Additionally, the list comprehension adds the names of the encoded
categorical columns with _encoded appended to each column name. These
encoded columns were generated in the previous step using the
OneHotEncoder technique.
160
df using the transform() method. This step generates a new column named
features in the DataFrame. Each row in this column contains a vector that
encapsulates the values of all selected features for that particular row.
The dataset is split into training and testing sets with an 80-20 ratio. This
split is essential to evaluate the model’s performance on unseen data.
161
regression model. During training, the model learns to predict the target
variable (price) based on the provided feature vectors (features) in the
training dataset. The result of this operation is the creation of a trained
model, stored in the spark_model variable.
With this trained model, we can now access the number of trees and
maximum depth from the GBT model:
The output also shows the maximum depth of each of the 20 trees. The
depth of a tree signifies how many splits or nodes it can have, allowing it to
learn intricate patterns in the data. A depth of 5 indicates that each of the 20
trees of the GBT model can have up to 5 levels of splits, potentially
capturing relatively detailed features of the data. However, a deeper tree
may also lead to overfitting, especially if the dataset is not large enough to
support such complexity.
We can also access the feature importances from the trained GBT model:
162
importances))
Then print the value of each feature importance with the following loop:
[Out]:
Sorted Feature Importances:
area: 0.3818
bathrooms: 0.1089
stories: 0.1085
parking: 0.0900
bedrooms: 0.0808
airconditioning_encoded: 0.0343
mainroad_encoded: 0.0240
basement_encoded: 0.0227
furnishingstatus_encoded: 0.0081
guestroom_encoded: 0.0000
hotwaterheating_encoded: 0.0000
prefarea_encoded: 0.0000
The output confirms what we already know from the Scikit-Learn GBT
regression model of the previous subsection: that the area of a property and
the number of bathrooms it has are the key drivers in the model. The output
also shows that whether a house has a guest room, hot water heating
system, or is located in a preferred area has no importance at all. This
suggests that removing these features from the model may boost
performance by reducing the model’s complexity and its potential for
overfitting.
The trained model is employed to make predictions on the test dataset. The
transform() method is utilized, which applies the model to the test data and
generates predictions for the target variable (price).
163
[Out]:
+---------+-----------+
| Price | Prediction|
+---------+-----------+
| 6930000 | 5365539 |
| 7070000 | 5303199 |
| 7210000 | 7355716 |
| 7350000 | 5057867 |
| 7455000 | 2507188 |
+---------+-----------+
This output suggests that the model both overestimates and underestimates
the property prices. It tends to underestimate the prices more frequently
than overestimating them.
For the first property with an actual price of $6,930,000, the model predicts
a value of $5,365,539. This prediction is less than the actual price,
indicating an underestimation by the model. Similarly, the second property
with an actual price of $7,070,000 receives a prediction of $5,303,199,
which is lower than the actual price, also suggesting underestimation. The
third property with an actual price of $7,210,000 gets a prediction of
$7,355,716, which is slightly higher than the actual price, indicating
overestimation.
For the fourth property with an actual price of $7,350,000, the model
predicts $5,057,867, reflecting underestimation. Finally, the fifth property
with an actual price of $7,455,000
This step focuses on evaluating the model’s performance using RMSE and
R2. In the first line, an instance of the RegressionEvaluator class is created
and assigned to the variable evaluator_rmse. The labelCol parameter is set
to price, which represents 164
With these results, let’s compare the performance of the GBT regression
model in this chapter with the other tree-based regression algorithms
(decision trees and random forests) in the preceding chapters. Based on the
evaluation metrics shown in the following table, the random forest model
appears to be better compared to the GBT
and decision tree models. This conclusion is drawn from the lower RMSE
value and the higher R-squared (r2) value shown by the random forest
model, as elaborated here:
• Lower RMSE: The random forest model has the lowest root mean
the magnitude of prediction errors. The fact that the random forest
model has the lowest RMSE suggests that it is making predictions that
165
this case, the random forest model’s ability to explain a larger portion of the
variance in property prices suggests that it captures underlying patterns in
the data more effectively.
On the other hand, even though the random forest model stands out as the
best-performing model among the three evaluated, with a lower RMSE and
a relatively higher R-squared (r2) value of 0.50, there is still room for
improvement in terms of its predictive capability. While the R-squared
value of 0.50 indicates that the model explains approximately 50% of the
variance in property prices, there remains potential to enhance its ability to
capture more nuances and patterns within the data. This suggests that
further optimization of the model’s hyperparameters and potentially
removing the features with the least importances could lead to even more
accurate predictions and a higher r2 value.
Model
RMSE
r2
Gradient-boosted tree
1,335,830
0.42
random forest
1,237,365
0.50
decision tree
1,463,512
0.32
166
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
[In]: y = pandas_df['price']
feature_names_out(cat_cols))
167
size=0.2, random_state=42)
[In]: print(results_df.head())
168
In this subsection, we do the same with PySpark code: combine all the
snippets we have used to prepare the housing data, build and train the GBT
regression model, make predictions, and evaluate the model’s performance.
# Import necessary libraries
VectorAssembler
getOrCreate()
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/housing.csv')
return pd.read_csv(url)
spark_df = indexer.fit(spark_df).transform(spark_df)
169
outputCol='features')
seed=42)
# Training the Gradient-Boosted Tree Regression model
# Access the number of trees and maximum depth from the GBT model
importances))
170
Chapter 6 Gradient-Boosted tree reGression with pandas, sCikit-Learn, and
pyspark
predictionCol='prediction', metricName='rmse')
predictionCol='prediction', metricName='r2')
[In]: r2 = evaluator_r2.evaluate(predictions)
Summary
This chapter introduced gradient-boosted tree (GBT) regression and
demonstrated the construction of a regression model using the gradient
boosting algorithm. The chapter also examined the similarities between
Pandas and PySpark in terms of reading and exploring data and showed that
Scikit-Learn and PySpark follow similar modeling steps, which include
transformation, categorical encoding, data scaling, model construction,
prediction, and evaluation.
The chapter also compared the performance of the GBT regression model
with the performance of the other two tree-based algorithms (decision trees
and random forests).
The key takeaway is that even the best-performing model among the three
can still benefit from further hyperparameter optimization and
dimensionality reduction.
171
172
CHAPTER 7
Logistic Regression
173
For the purpose of this project, we will use the same diabetes dataset that
we utilized in Chapter 2 to select algorithms. Before building our logistic
regression model and training it on this data, let’s first remind ourselves of
the main attributes of this dataset.
OceanofPDF.com
The Dataset
The dataset used for this project is the Pima Indians Diabetes Database we
used in Chapter 2. The dataset contains 768 records, each of which
represents a Pima Indian woman and includes various health-related
attributes, along with a target variable indicating whether the woman
developed diabetes or not.
Source: Kaggle
URL: www.kaggle.com/uciml/pima-indians-diabetes-database
Date: 2016
174
url = ('https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/diabetes.csv')
return pd.read_csv(url)
We then write a Python function that prints basic information about the
pandas_df:
pandas_df = load_diabetes_csv()
selected_df = pandas_df[columns]
selected_df = selected_df.loc[(pandas_df['Glucose'] != 0)
& (pandas_df['BloodPressure'] != 0)
print(f'First 5 rows:\n{selected_df.head()}\n')
print(f'Shape: {selected_df.shape}\n')
print("Info:")
selected_df.info()
print(f'Summary statistics:\n'
f'{selected_df.describe().round(2)}\n')
The reason for this filtering was explained back in Chapter 2, where we
used the same dataset to illustrate the concept of k-fold cross validation. We
realized there were invalid readings for Glucose, Blood Pressure, Skin
Thickness, Insulin, and BMI as they had 0 values. We reasoned that it made
sense to exclude Skin Thickness and Insulin as the number of cases with 0
values was too large. We also decided to exclude rows with 0 values for
Glucose, Blood Pressure, and BMI. The number of invalid cases was not
too large, so excluding rows won’t significantly impact the sample size.
175
{In]: print(explore_pandas_df())
[Out]:
Pregnancies
Glucose
Blood
BMI
Diabetes
Age
Pressure
Pedigree
Function
148
72
33.6
0.627
50
85
66
26.6
0.351
31
183
64
23.3
0.672
32
89
66
28.1
0.167
21
137
40
43.1
2.288
33
Shape: (724, 6)
Info:
<class 'pandas.core.frame.DataFrame'>
176
Summary statistics:
Pregnancies
Glucose
BloodPressure
BMI
Diabetes
Age
Pedigree
Function
count
724
724
724
724
724
724
mean
3.87
121.88
72.4
32.47
0.47
33.35
std
3.36
30.75
12.38
6.89
0.33
11.77
min
44
24
18.2
0.08
21
25%
99.75
64
27.5
0.24
24
50%
117
72
32.4
0.38
29
75%
142
80
36.6
0.63
41
max
17
199
122
67.1
2.42
81
We get a set of four outputs. The first section of the output shows the first
five rows of the DataFrame. Each row represents a set of values for the
columns Pregnancies, Glucose, Blood Pressure, BMI, Diabetes Pedigree
Function, and Age. The Shape section indicates the shape of the
DataFrame, showing the number of rows and columns. In this case, the
DataFrame has 724 rows and 6 columns. The Info section provides
information about the DataFrame, including the data types of each column
and the count of non-null values. Finally, the Summary statistics section
presents summary statistics for each column in the DataFrame. It provides
key statistical measures such as the count, mean, standard deviation,
minimum, 25th percentile, median (50th percentile), 75th percentile, and
maximum value for each numerical column.
pandas_df = load_diabetes_csv()
177
spark_df = spark.createDataFrame(pandas_df)
spark_df = spark_df.filter((col('Glucose') != 0)
& (col('BloodPressure') != 0)
spark_df.show(5)
print(
spark_df.printSchema()
spark_df.summary().show()
The code first imports the necessary classes (SparkSession and col) and
defines the explore_spark_df() function, which loads the diabetes CSV file
into pandas_df DataFrame using the load_diabetes_csv() function. It then
creates a SparkSession object named spark using the
SparkSession.builder.getOrCreate() method and converts the pandas_df to
spark_df using the createDataFrame() method.
Finally, the function displays a summary that includes statistics for the
selected columns using the summary() method.
[In]: print(explore_spark_df())
[Out]:
178
Glucose
Blood
BMI
Diabetes
Age
Pressure
Pedigree
Function
148
72
33.6
0.627
50
85
66
26.6
0.351
31
183
64
23.3
0.672
32
89
66
28.1
0.167
21
137
40
43.1
2.288
33
Shape: (724, 6)
root
summary
Pregnancies
Glucose
Blood
BMI
Diabetes
Age
Pressure
Pedigree
Function
count
724
724
724
724
724
724
mean
3.87
121.88
72.40
32.47
0.47
33.35
stddev
3.36
30.75
12.38
6.89
0.33
11.77
min
44
24
18.2
0.078
21
25%
99
64
27.5
0.245
24
50%
117
72
32.4
0.378
29
75%
142
80
36.6
0.627
41
max
17
199
122
67.1
2.42
81
179
The target variable, which is not shown in the output, signifies whether a
Pima woman has diabetes (assigned a value of 1) or does not have diabetes
(assigned a value of 0). This variable is the focus of our prediction.
Logistic Regression
In this section, we build, train, and evaluate a logistic regression model and
use it to predict the likelihood of a Pima Indian woman having diabetes
based on a set of predictor variables. We use six features: number of
pregnancies, blood pressure, BMI, glucose level, diabetes pedigree
function, and age. These features are important factors in predicting the
presence of diabetes.
years old, the logistic regression model might output a probability score of
0.8, indicating that this woman has a high likelihood of having diabetes. On
the other hand, a woman who has had only one pregnancy, normal blood
pressure, low BMI, low glucose levels, no family history of diabetes, and is
30 years old might have a probability score of 0.1, indicating a low
likelihood of having diabetes.
Before building the logistic regression model, let’s begin by examining the
parallels between Scikit-Learn and PySpark in terms of their modeling steps
and the functions and classes they utilize.
1. Data preparation:
like randomSplit for splitting data into training and testing sets,
vector column.
2. Model training:
181
there are no built-in functions for calculating the other metrics for
4. Prediction:
This means that the model probabilities and evaluation metrics may vary
due to the differing default parameters between the two frameworks. In
Chapter 16, we demonstrate how to fine-tune and align these parameters.
182
Chapter 7 LogistiC regression with pandas, sCikit-Learn, and pyspark
Logistic Regression with Scikit-Learn
Let’s start with Scikit-Learn. To predict the likelihood of diabetes using the
Pima Indian diabetes dataset, we adhere to the same steps outlined in the
preceding subsection: data preparation, model training, model evaluation,
and prediction. The code will be encapsulated within functions to enhance
the ease of execution and comprehension.
"""
Parameters:
Returns:
"""
pandas_df = pd.read_csv(url)
return pandas_df
183
"""
Parameters:
test split.
Returns:
"""
random_state=random_state)
"""
Parameters:
Returns:
"""
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
184
"""
data.
Parameters:
Returns:
regression model.
"""
model = LogisticRegression()
model.fit(X_train, y_train)
return model
score, and confusion matrix. The metrics are printed, and a DataFrame
containing the actual and predicted values is displayed.
"""
results.
Parameters:
"""
y_pred = model.predict(X_test)
185
f1 = f1_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'Confusion Matrix:\n{confusion}')
y_pred})
print(result_df.head())
In this final step, we call the functions that were defined previously. There
are nine substeps as indicated here:
'spark_book/main/diabetes.csv')
6.4. Select only the specified columns and filter out rows with zeros in key
features
& (pandas_df['BloodPressure'] != 0)
186
[In]: y = pandas_df['Outcome']
Accuracy: 0.79
Precision: 0.66
Recall: 0.63
F1 Score: 0.64
Confusion Matrix:
[[88 14]
[16 27]]
Actual Predicted
00
00
11
01
11
After executing the nine substeps, we get the output shown previously.
Before explaining this output, let’s first go through the Scikit-Learn code
step by step.
187
This step has two substeps. The first substep defines a function named
split_data() that splits the dataset into training and testing sets. The function
takes several parameters:
allocation.
• random_state: Sets the seed for the random number generator,
test_split() function. This takes the feature matrix X and the target variable
y, along with the specified test_size and random_state values. It returns four
separate datasets:
188
The second substep defines the scale_data() function. This standardizes the
features of the dataset, ensuring that they have similar scales. This is
particularly important in machine learning, where models like logistic
regression can be sensitive to the varying magnitudes of features. The
function takes two parameters:
The next step involves applying the scaling transformation to the training
data using the line X_train_scaled=scaler.fit_transform(X_train). Here, the
fit_transform() method of the scaler object is used to compute the scaling
parameters from the training data and simultaneously transform the training
features accordingly. The scaled training features are stored in the variable
X_train_scaled.
Next, the same scaling transformation is applied to the testing data using the
line X_test_scaled=scaler.transform(X_test). However, in this case, the
transform() method is used because the scaling parameters were already
computed using the training data. The scaled testing features are stored in
the variable X_test_scaled.
Finally, the function returns the scaled training and testing feature matrices.
In this step, a function called train_lr is defined. This function takes two
input arguments: X_train and y_train. Its purpose is to train a logistic
regression model using the provided training data and then return the
trained model.
Moving to the next step, the fit method of the LogisticRegression model is
called with the training data X_train and the corresponding labels y_train.
This method trains the logistic regression model using the provided data by
adjusting its parameters to fit the given training examples. Once the model
has been trained, it is returned from the function as the output.
189
Inside the function, the predict method of the classification model is called
with the test features X_test to make predictions on the test data. The
predicted labels are assigned to the variable y_pred.
Next, the following key evaluation metrics of the model are computed:
instances (both true positives and true negatives) out of the total number of
instances in the dataset. In other words, accuracy tells us how often the
model’s predictions match the actual labels. This metric is calculated by
comparing the predicted labels y_pred with the true labels y_test.
The print statements are then used to display the calculated metrics and the
confusion matrix. After this, a Pandas DataFrame is created to store the
actual test labels (y_test) and the predicted labels (y_pred), while the last
line of code prints the first five rows of the DataFrame to show a
comparison between the actual and predicted labels.
190
• Select only the specified columns and filter out rows with zeros in key
features.
• Split the data into training and testing sets by calling the split_
data(X, y) function.
• Scale the features in the training and testing sets by using the scale_
• Train the model on the scaled training data using the train_lr(X_
• Evaluate the model on the scaled testing data using the evaluate_
After running the code in these steps, we get the output from evaluating the
classification model using various metrics, including accuracy, precision,
recall, F1
score, and the confusion matrix. The following is an interpretation of the
output:
number of instances.
• Recall (0.63): Recall, also known as sensitivity or true positive rate, is the
ratio of correctly predicted positive instances (true positives) to all actual
positive instances (true positives + false negatives). Here, the model’s recall
is about 63%, indicating that it correctly identified about 63% of the actual
positive instances.
191
and predicted labels for the top five instances. In four out of the five
instances, or 80%, the model’s predictions matched the actual labels
(0 and 1). This is slightly higher than the overall accuracy of 79%,
which takes into account all the samples, not just the first five rows.
192
builder.
getOrCreate()
This step defines a function to read data from a URL and convert it into a
Spark DataFrame using pd.read_csv and spark.createDataFrame.
"""
DataFrame.
Parameters:
Returns:
data.
"""
pandas_df = pd.read_csv(url)
spark_df = spark.createDataFrame(pandas_df)
return spark_df
4.1. Create a function that splits the data into training and testing datasets
using randomSplit
"""
Parameters:
193
DataFrames.
"""
"""
Parameters:
DataFrame.
DataFrame.
Returns:
"""
assembler = VectorAssembler(inputCols=feature_cols,
outputCol='features')
return assembler.transform(train_data),
assembler.transform(test_data)
"""
Parameters:
DataFrame.
194
DataFrame.
Returns:
Spark DataFrames.
"""
scaler = StandardScaler(inputCol='features',
outputCol='scaled_features')
scaler_model = scaler.fit(train_data)
return scaler_model.transform(train_data),
scaler_model.transform(test_data)
"""
Parameters:
DataFrame.
Returns:
pyspark.ml.classification.LogisticRegressionModel: The
"""
spark.conf.set("spark.seed", "42")
lr = LogisticRegression(labelCol='Outcome',
featuresCol='scaled_features')
return lr.fit(train_data)
195
Chapter 7 LogistiC regression with pandas, sCikit-Learn, and pyspark Step
6: Model evaluation
This step creates a function that evaluates the trained model on the testing
dataset.
"""
Parameters:
lr_model
(pyspark.ml.classification.LogisticRegressionModel): The
DataFrame.
"""
predictions = lr_model.transform(test_data)
predictions = predictions.withColumn('prediction',
col('prediction').cast('int'))
predictions_and_labels = predictions.select(['Outcome',
'prediction'])
tp = predictions_and_labels[(predictions_and_labels.Outcome
tn = predictions_and_labels[(predictions_and_labels.Outcome
fp = predictions_and_labels[(predictions_and_labels.Outcome
fn = predictions_and_labels[(predictions_and_labels.Outcome
print(f'Accuracy: {acc:.2f}')
print(f'Precision: {prec:.2f}')
196
print(f'Confusion Matrix:\n{confusion}')
predictions.select('Outcome', 'prediction').show(5)
This step calls the functions defined in the previous steps. It also provides
the URL
location.
'spark_book/main/diabetes.csv')
7.2. Call the read_data() function to read dataset and create Spark
DataFrame
& (col('BloodPressure') != 0)
7.3. Call the split_data() function to split data into training and testing sets
197
7.8. Call the evaluate_model() function to evaluate the model on the test set
[Out]:
Accuracy: 0.78
Precision: 0.62
Recall: 0.56
F1 Score: 0.59
Confusion Matrix:
[[23, 14],
[18, 91]]
+-------+----------+
|Outcome|prediction|
+-------+----------+
| 0| 0|
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
+-------+----------+
Before interpreting the output, let’s take a moment to go through the code
we have used to generate it. After importing the necessary libraries required
for different aspects of working with PySpark and performing the machine
learning task, we created a Spark Session to allow us to interact with the
PySpark API. We then defined a function (read_data) that reads data from
the given URL pointing to a CSV file and converts it into a Spark
DataFrame. The function takes a URL as input and returns a Spark
DataFrame.
Inside the function, the Pandas library is utilized to read the data from the
URL into a Pandas DataFrame named pandas_df. The
spark.createDataFrame(pandas_df) method is used to convert the Pandas
DataFrame into a Spark DataFrame named spark_
198
Chapter 7 LogistiC regression with pandas, sCikit-Learn, and pyspark What
follows are the steps to prepare data, train and evaluate the model, and
make predictions:
This step has three substeps. In the first substep, a function named
split_data is defined, which takes a Spark DataFrame named spark_df as an
argument. The purpose of this function is to partition the input DataFrame
into two subsets: one for training and the other for testing.
combined.
• outputCol: This is set to features, which is the name of the new vector
column that will be created.
The transform() method is then applied to the assembler instance for both
the train_data and test_data DataFrames. This transformation takes the
specified feature columns and combines them into a new vector column
named features.
199
In the final substep of the data preparation step, the code defines a function
named scale_features that takes two arguments:
features column
features column
The purpose of this function is to scale the features within the features
column using the StandardScaler transformation provided by PySpark’s
machine learning library.
The scaler.fit(train_data) line inside the function fits the scaler model using
the training data. This calculates the mean and standard deviation needed
for scaling the features. The transform() method is then applied to the
scaler_model for both the train_data and test_data DataFrames. This
transformation scales the features in the features column and stores the
scaled values in the scaled_features column.
The code in this step defines a function named train_model that takes one
argument:
200
• labelCol: This is set to Outcome, indicating the column that contains the
binary labels (0 or 1) for the logistic regression.
The logistic regression model is trained using the fit() method on the lr
instance.
The train_data DataFrame, which contains both the scaled features and the
target labels, is provided as the training dataset. The trained logistic
regression model is then returned from the function.
In this step of the modeling process, the code defines a function named
evaluate_
features
201
• True positives (tp): The count of instances where the actual outcome is
positive (1) and the model’s prediction is also positive (1)
• True negatives (tn): The count of instances where the actual outcome is
negative (0) and the model’s prediction is also negative (0)
• False positives (fp): The count of instances where the actual outcome is
negative (0) but the model’s prediction is positive (1)
• False negatives (fn): The count of instances where the actual outcome is
positive (1) but the model’s prediction is negative (0)
• Recall (rec): The ratio of true positives to the sum of true positives and
false negatives
The final step (step 7) pertains to calling the functions, which involves the
following eight substeps:
• Read and prepare data using read_data, filter rows, and select
columns.
202
• Recall (0.56): The recall score is 0.56, indicating that the model
together.
Scikit-Learn
"""
DataFrame.
204
Returns:
"""
pandas_df = pd.read_csv(url)
return pandas_df
"""
Parameters:
Returns:
"""
random_state=random_state)
"""
Parameters:
Returns:
"""
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
205
"""
data.
Parameters:
Returns:
regression model.
"""
model = LogisticRegression()
model.fit(X_train, y_train)
return model
"""
results.
Parameters:
"""
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
206
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'Confusion Matrix:\n{confusion}')
y_pred})
print(result_df.head())
'spark_book/main/diabetes.csv')
# Select only the specified columns and filter out rows with zeros in key
features
& (pandas_df['BloodPressure'] != 0)
[In]: y = pandas_df['Outcome']
PySpark
getOrCreate()
"""
DataFrame.
Parameters:
Returns:
data.
"""
pandas_df = pd.read_csv(url)
spark_df = spark.createDataFrame(pandas_df)
return spark_df
208
"""
Parameters:
Returns:
DataFrames.
"""
return spark_df.randomSplit([0.8, 0.2], seed=42)
"""
Parameters:
DataFrame.
DataFrame.
Returns:
"""
assembler = VectorAssembler(inputCols=feature_cols,
outputCol='features')
return assembler.transform(train_data),
assembler.transform(test_data)
"""
Scale the features in the dataset.
209
DataFrame.
DataFrame.
Returns:
Spark DataFrames.
"""
scaler = StandardScaler(inputCol='features',
outputCol='scaled_features')
scaler_model = scaler.fit(train_data)
return scaler_model.transform(train_data),
scaler_model.transform(test_data)
"""
Parameters:
train_data (pyspark.sql.DataFrame): The training Spark
DataFrame.
Returns:
pyspark.ml.classification.LogisticRegressionModel: The
"""
spark.conf.set("spark.seed", "42")
lr = LogisticRegression(labelCol='Outcome',
featuresCol='scaled_features')
return lr.fit(train_data)
"""
210
lr_model
(pyspark.ml.classification.LogisticRegressionModel): The
"""
predictions = lr_model.transform(test_data)
predictions = predictions.withColumn('prediction',
col('prediction').cast('int'))
predictions_and_labels = predictions.select(['Outcome',
'prediction'])
tp = predictions_and_labels[(predictions_and_labels.Outcome
tn = predictions_and_labels[(predictions_and_labels.Outcome
fp = predictions_and_labels[(predictions_and_labels.Outcome
fn = predictions_and_labels[(predictions_and_labels.Outcome
print(f'Accuracy: {acc:.2f}')
print(f'Precision: {prec:.2f}')
print(f'Recall: {rec:.2f}')
print(f'Confusion Matrix:\n{confusion}')
predictions.select('Outcome', 'prediction').show(5)
'spark_book/main/diabetes.csv')
211
& (col('BloodPressure') != 0)
Summary
In the next chapter, we will continue our journey with supervised learning
for classification algorithms. Specifically, we will introduce an alternative
classification model to logistic regression based on decision trees. We will
demonstrate how to prepare data, build, train, and evaluate the model, and
make predictions on new, unseen data.
Both Scikit-Learn and PySpark will be used. Pandas and PySpark will also
be compared.
212
CHAPTER 8
Decision Tree
Classification with
Pandas, Scikit-Learn,
and PySpark
Where decision tree classifiers excel over the logistic regression models we
built in the preceding chapter is in capturing nonlinear relationships. While
logistic regression assumes a linear relationship between features and the
log-odds of the target, decision 213
In this chapter, we will utilize two powerful libraries to build a decision tree
classifier: Scikit-Learn and PySpark. By comparing their Python code, we
will highlight the similarities between them, making it easier for data
scientists to switch from Scikit-Learn to PySpark and leverage the
advantages offered by PySpark’s distributed computing capabilities.
The Dataset
For the project presented in this chapter, we are utilizing the well-known
Iris dataset to construct a decision tree classifier. The dataset contains
measurements of four distinct attributes—sepal length, sepal width, petal
length, and petal width—pertaining to three different species of iris flowers:
setosa, versicolor, and virginica. The dataset, which is pre-installed with
Scikit-Learn by default, contains 150 records, corresponding to 50
214
"""
Returns:
"""
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
pandas_df['target'] = y
pandas_df['target_names'] = pandas_df['target'].map(
dict(zip(range(len(target_names)), target_names)))
return pandas_df
The first two lines of code import the necessary libraries. load_iris is a
function that allows us to load the Iris dataset, and pd is an alias for the
Pandas library. Next, the code defines the function
create_pandas_dataframe(), which doesn’t take any arguments.
Inside the function, the load_iris() function loads the Iris dataset, which
contains feature data, target labels, feature names, and target names. The
code then proceeds to extract the feature data (X) and target labels (y) from
the loaded dataset. It also extracts the feature names and target names.
We can call the function to create the pandas_df DataFrame with the
following line:
[In]: pandas_df = create_pandas_dataframe()
"""
Args:
converted.
Returns:
"""
spark =
SparkSession.builder.appName("PandasToSpark").getOrCreate()
spark_df = spark.createDataFrame(pandas_df)
return spark_df
The code first imports the SparkSession class, which is the entry point for
using Spark functionality. It then defines the convert_to_spark_df()
function. This takes one argument, pandas_df, which is the pandas
DataFrame that we want to convert to a PySpark DataFrame. Inside the
function, a Spark Session with the app name
“PandasToSpark” is created using the SparkSession.builder.appName()
method. The createDataFrame() method of the Spark Session is then used to
convert the provided Pandas DataFrame (pandas_df) to a PySpark
DataFrame (spark_df). Finally, the function returns the converted PySpark
DataFrame (spark_df).
216
[In]: print(pandas_df.head())
[Out]:
sepal length
sepal width
petal length
target_
(cm)
(cm)
(cm)
(cm)
names
5.1
3.5
1.4
0.2
setosa
4.9
1.4
0.2
setosa
4.7
3.2
1.3
0.2
setosa
4.6
3.1
1.5
0.2
setosa
3.6
1.4
0.2
setosa
We can observe from the output that the label has already been converted to
a numerical format. Therefore, there will be no need for us to perform label
encoding during the modeling steps. Additionally, all features are presented
in centimeters, implying that standard scaling is unnecessary, as the features
are already on a uniform scale. It’s worth noting, however, that while tree-
based algorithms do not strictly require scaling, some practitioners do
recommend its application.
In PySpark, we can use the show(5) method to display the top five rows:
[In]: spark_df.show(5)
[Out]:
sepal length
sepal width
petal length
target_
(cm)
(cm)
(cm)
names
5.1
3.5
1.4
0.2
setosa
4.9
3
1.4
0.2
setosa
4.7
3.2
1.3
0.2
setosa
4.6
3.1
1.5
0.2
setosa
3.6
1.4
0.2
setosa
217
We can print the shape of the pandas_df (i.e., number of rows and columns)
using the Pandas shape attribute:
[In]: print(pandas_df.shape)
[Out]: (150, 6)
We can achieve the same result with the following PySpark code:
[Out]: (150, 6)
Since PySpark doesn’t have the shape attribute, we combined two methods:
count() and len() to get the number of rows and columns, respectively.
[In]: pandas_df.drop(columns=['target']).describe()
[Out]:
Sepal length (cm)
count
150
150
150
150
mean
5.84
3.06
3.76
1.20
std
0.83
0.44
1.77
0.76
min
4.3
0.1
25%
5.1
2.8
1.6
0.3
50%
5.8
4.3
1.3
75%
6.4
3.3
5.1
1.8
max
7.9
4.4
6.9
2.5
218
describe() code.
[In]: spark_df.select(numeric_columns).summary().show()
[Out]:
summary
count
150
150
150
150
mean
5.84
3.06
3.76
1.20
stddev
0.83
0.44
1.77
0.76
min
4.3
2
0.1
25%
5.1
2.8
1.6
0.3
50%
5.8
4.3
1.3
75%
6.4
3.3
5.1
1.8
max
7.9
4.4
6.9
2.5
columns. This list includes all columns from the DataFrame except for the
target and target_names columns, which are excluded using a conditional
statement. The select() method is then employed to choose only the
specified numeric columns. Next, the summary() method is applied to
generate the summary statistics, including count, mean, 219
We can observe that the outputs from Pandas and PySpark are nearly
identical, differing only in two cosmetic aspects. In PySpark, the statistics
column is labeled as summary, whereas in Pandas, it remains blank.
Additionally, PySpark labels the standard deviation as stddev, while Pandas
uses std.
[In]: spark_df.select(numeric_columns).describe().show()
[Out]:
summary
count
150
150
150
150
mean
5.84
3.06
3.76
1.20
stddev
0.83
0.44
1.77
0.76
min
4.3
0.1
max
7.9
4.4
6.9
2.5
The describe() method may come handy when the user doesn’t need the
quartiles.
To dig deeper into the dataset, we can check for null values in both
pandas_df and spark_df DataFrames.
Starting with Pandas, we can use the isnull() and sum() functions:
[In]: print(pandas_df.isnull().sum())
[Out]:
target 0
target_names 0
dtype: int64
[In]: null_counts.show()
[Out]:
sepal length
sepal width
petal length
petal width
target
target_
(cm)
(cm)
(cm)
(cm)
names
The PySpark code begins by utilizing the select() method on the spark_df
DataFrame. Within this method, a list comprehension iterates through each
column in the DataFrame, accessed through the spark_df.columns attribute.
For each column, a series of functions is applied: the col() function
references the current column, followed by the isNull() function to
determine null values. To convert resulting Booleans into integers (1 for
True, 0 for False), cast("integer") is used. The integers are then summed up
using the sum() function. The alias(column) function assigns names to
columns. Finally, the show() method prints the results.
Both Pandas and PySpark outputs indicate that there are no missing records
within the Iris dataset.
In this section, we aim to build, train, and evaluate a decision tree classifier
using the Iris dataset. This contains four features (length and width of sepals
and petals) of 50 samples of three species of Iris (setosa, virginica, and
versicolor). Since the features are in the same unit of measurement (cm),
scaling is not required. Decision trees perform equally well with or without
scaling.
221
Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD
pyspark Using the Iris dataset, a decision tree classifier will make
predictions by following a set of rules based on the features of the input
data. It starts at the root node and evaluates a feature to determine which
branch to follow. For example, if the petal length is less than 2.5 cm, it may
follow the left branch, while if it is greater than or equal to 2.5 cm, it may
follow the right branch. This process continues until a leaf node is reached,
where the classifier assigns a class label based on the majority class of
training samples that reached that leaf node. This assigned class label serves
as the prediction for the given input.
This has the ability to process much larger volumes of data and increase the
speed of training and evaluation.
Before we begin the process of model building, let’s see how Scikit-Learn
and PySpark compare in terms of the modeling steps and the functions and
classes they use.
The following is a table comparing the classes and functions used in Scikit-
Learn and their equivalent counterparts in PySpark for model construction
in the context of multiclass decision tree classification:
Task
Scikit-Learn
PySpark
Data preparation
splitting data
train_test_split
randomsplit
feature scaling
standardscaler
standardscaler
Categorical encoding
onehotencoder
onehotencoder
simpleimputer
imputer
feature vectorization
not applicable
Vectorassembler
Model training
( continued)
222
Scikit-Learn
PySpark
Model class
DecisiontreeClassifier
DecisiontreeClassifier
training
fit( )
fit( )
Model evaluation
accuracy
accuracy_score
MultiClassificationevaluator
precision
precision_score
recall
recall_score
f1 score
f1_score
prediction
Generate predictions
predict
transform
These comparisons reveal that both Scikit-Learn and PySpark adhere to the
same modeling steps of data preparation, model training, model evaluation,
and prediction. In some cases, even the names of functions or classes are
the same in both platforms. These include StandardScaler, used for
standardizing features by removing the mean and scaling to unit variance;
OneHotEncoder, employed to convert categorical data into onehot encoded
features; DecisionTreeClassifier, which represents a decision tree classifier
model; and the fit() method, which is used to train the machine learning
model.
223
It’s worth noting that we are building models with default hyperparameters,
meaning that the results between Scikit-Learn are likely to differ. We
demonstrate how to fine-tune the hyperparameters of a model in Chapter
16.
In this step, we import the necessary libraries to load the Iris data, split the
data into training and testing sets, build the decision tree classifier, and
calculate the accuracy score:
In this step, we define a function to load and return the Iris dataset features
(X) and labels (y).
iris = load_iris()
X = iris.data
y = iris.target
return X, y
224
In this step, we define a function to split data into training and testing sets.
test_size=test_size, random_state=random_state)
clf = DecisionTreeClassifier()
clf.fit(X, y)
return clf
The code in this step defines a function that evaluates and returns the
accuracy score of the decision tree classifier (clf) on the test features
(X_test) and labels (y_test).
y_pred = clf.predict(X_test)
return accuracy
[In]: X, y = load_iris_dataset()
6.2. Split the dataset into training and testing sets using the split_data
function
6.3. Build the decision tree model using the training data
225
Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD
pyspark 6.4. Evaluate the model on the test data
[Out]: 1.00
In this step, the code imports several modules and classes to prepare the
data, build and train the model, evaluate its performance, and make
predictions on test data.
The code begins by importing the load_iris function to load the Iris dataset.
OceanofPDF.com
In this step, the code defines a function to load and return the Iris dataset.
Inside the function, the code uses the load_iris function to load the Iris
dataset, including both the features and target labels. It then assigns the
feature data and target labels of the Iris dataset to the variables X and y,
respectively. The function finally returns the feature matrix X and the target
array y as a tuple.
In this step, the code defines a function named split_data to split the dataset
into training and testing sets. It takes the following arguments:
• test_size: The proportion of the dataset to include in the test split (default
is 0.2, or 20%).
226
The test_size argument determines the proportion of the data that should be
allocated for testing, in this case, 0.2 or 20%. The random_state argument
sets the seed for random shuffling of the data during the split.
Finally, the function returns the four sets (feature data and labels for
training and testing) as a tuple.
The next line trains the decision tree classifier on the provided feature
matrix X
and target labels y. The fit method learns the patterns and relationships in
the data, enabling the classifier to make predictions based on the learned
knowledge.
Finally, the function returns the trained decision tree classifier (clf) as the
output of the function. This allows us to use the trained classifier for
making predictions on new, unseen data.
227
Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD
pyspark Having trained the model, we can check which features are driving
the model’s predictions:
'petal width']
importances))
print(f"{name}: {importance:.4f}")
[Out]:
[In]: clf.feature_importances_
However, we need to add the feature names and sort the importances by
descending order. The code begins by defining a list feature_names, which
contains the names of the four features in the Iris dataset. It then extracts the
feature importances calculated by the trained decision tree classifier clf. The
attribute feature_importances_ provides importance scores assigned to each
feature by the classifier.
Moving to the next step, the code combines the feature_names list and the
feature_
importances array into a list of tuples using the zip function. The code then
sorts the list of tuples (importances_with_names) in descending order using
the sorted function with the reverse=True parameter based on the
importance scores (where x[1] refers to the second element of each tuple,
i.e., the importance score).
The following step involves a loop that iterates through the sorted list of
tuples, importances_with_names_sorted, and unpacks each tuple into the
variables name and importance. Within the loop, a line prints the name of a
feature (name) along with its corresponding importance score (importance).
228
We can also take a look at a sample comparison between the actual and
predicted values:
[In]: print(comparison_df)
[Out]:
Actual
Predicted
1
The output indicates that the model has correctly classified each species, as
all five predicted values match the actual label values. This results in an
accuracy of 100%.
229
The next line calculates the accuracy of the predicted labels (y_pred) when
compared to the true labels (y_test) using the accuracy_score function. The
accuracy_
Finally, the function returns the calculated accuracy as the output of the
function.
The accuracy score indicates how well the trained classifier performs on the
test data.
The dataset is then split into training and testing sets using the split_data
function, with the default test size of 20%. A decision tree classifier is built
using the training data with the build_decision_tree function. The
classifier’s accuracy is evaluated on the test data using the evaluate_model
function, which calculates the accuracy using accuracy_
score. Finally, the accuracy of the model on the test data is printed, yielding
an accuracy of 1.00 (or 100%), indicating a perfect fit for this specific
execution.
230
This step has two substeps: loading the dataset and converting it to
PySpark.
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
pandas_df['label'] = y
return pandas_df
SparkSession (spark)
return spark.createDataFrame(pandas_df)
assembler = VectorAssembler(inputCols=spark_df.columns[:-1],
outputCol="features")
transformed_data = assembler.transform(spark_df)
return transformed_data
train_data, test_data =
transformed_data.randomSplit([train_ratio, 1 - train_ratio],
seed=seed)
231
In this step, we build and train a decision tree classifier using the provided
training data (Spark DataFrame).
dt = DecisionTreeClassifier(labelCol="label",
featuresCol="features")
model = dt.fit(train_data)
return model
predictions = model.transform(test_data)
evaluator =
MulticlassClassificationEvaluator(labelCol="label",
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
In this step, we use the functions we defined previously. We begin with the
creation of a Spark Session.
getOrCreate()
232
[Out]: 0.94
In this step, we import the required libraries and modules that will be used
throughout the code. These libraries include Pandas, PySpark for distributed
data processing, and specific modules for machine learning tasks such as
creating feature vectors and training a decision tree classifier. We also
import the load_iris function from Scikit-Learn to load the Iris dataset.
SparkSession (spark)
233
One important feature of decision trees is that once the model is trained, we
can extract the feature importances to identify the features that drive the
predictions and those that are redundant. We can use the featureImportances
attribute to achieve this task.
In the following code, we add a few more lines to this attribute to append
the feature names to the feature importances and sort them in descending
order:
[Out]:
petal length (cm): 0.52
234
The output indicates that the petal length and width of the flowers are the
driving force behind the model’s predictions. In the PySpark model, the
sepal length and width are redundant as their score is 0, respectively.
It’s worth noting that the accuracy of the PySpark model is lower than that
of Scikit-Learn, which was 100%. This difference arises because we are
building models with default hyperparameters, and the two frameworks
have distinct default settings. We will explore how to customize algorithm
hyperparameters in Chapter 16.
Scikit-Learn
235
"""
Load and return the Iris dataset features (X) and labels
(y).
Returns:
iris = load_iris()
X = iris.data
y = iris.target
return X, y
"""
Args:
set (default=0.2).
default=42).
Returns:
test_size=test_size, random_state=random_state)
"""
236
Args:
Returns:
classifier.
"""
clf = DecisionTreeClassifier()
clf.fit(X, y)
return clf
(y_test).
Args:
classifier.
Returns:
test data.
"""
y_pred = clf.predict(X_test)
return accuracy
[In]: X, y = load_iris_dataset()
237
Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD
pyspark
PySpark
"""
Returns:
"""
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
pandas_df['label'] = y
return pandas_df
"""
provided SparkSession.
Args:
238
"""
return spark.createDataFrame(pandas_df)
"""
Args:
spark_df (pyspark.sql.DataFrame): The input Spark DataFrame.
Returns:
vector column.
"""
assembler = VectorAssembler(inputCols=spark_df.columns[:-1],
outputCol="features")
transformed_data = assembler.transform(spark_df)
return transformed_data
"""
Args:
Default is 42.
Returns:
Tuple[pyspark.sql.DataFrame, pyspark.sql.DataFrame]: A tuple
"""
train_data, test_data =
transformed_data.randomSplit([train_ratio, 1 - train_ratio],
239
"""
data.
Args:
Spark DataFrame.
Returns:
pyspark.ml.classification.DecisionTreeClassificationModel:
"""
dt = DecisionTreeClassifier(labelCol="label",
featuresCol="features")
model = dt.fit(train_data)
return model
"""
test data.
Args:
model
(pyspark.ml.classification.DecisionTreeClassificationModel):
DataFrame.
Returns:
"""
predictions = model.transform(test_data)
evaluator =
240
Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD
pyspark MulticlassClassificationEvaluator(labelCol="label",
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
getOrCreate()
Summary
241
CHAPTER 9
Random Forest
243
When making predictions for a new Iris flower, each decision tree in the
random forest independently predicts the species based on the features of
the flower. The final prediction is determined through a majority vote or
averaging of the individual predictions from all the decision trees in the
forest.
Both Scikit-Learn and PySpark provide tools for building, training, and
evaluating random forest algorithms. To start, let’s compare these tools
before we create models in each of the two libraries.
Here is a table that illustrates the classes and functions employed in Scikit-
Learn alongside their corresponding equivalents in PySpark for
constructing random forest classifiers:
Task
Scikit-Learn
PySpark
Data preparation
Splitting data
train_test_split
randomSplit
Feature scaling
StandardScaler
StandardScaler
Categorical encoding
OneHotEncoder
OneHotEncoder
SimpleImputer
Imputer
Feature vectorization
Not applicable
VectorAssembler
Model training
Model class
RandomForestClassifier
RandomForestClassifier
Training
fit( )
fit( )
( continued)
244
Scikit-Learn
PySpark
Model evaluation
Accuracy
accuracy_score
MultiClassificationEvaluator
precision
precision_score
Recall
recall_score
F1 score
f1_score
Prediction
Generate predictions
predict
transform
These comparisons show that both Scikit-Learn and PySpark follow the
same modeling steps of data preparation, model training, model evaluation,
and prediction.
The following code imports the necessary functions and classes, including
the load_
iris function to load the Iris dataset, the train_test_split function to split data
into training and testing sets, the RandomForestClassifier class to build the
random forest model, and the accuracy_score function to evaluate the
accuracy of predictions:
245
This function loads and returns the features (X) and labels (y) of the Iris
dataset. It uses the load_iris() function from Scikit-Learn to obtain the
dataset and assigns the features to X and the labels to y:
iris = load_iris()
X = iris.data
y = iris.target
return X, y
Step 3. The build_random_forest() function
This function builds and returns a random forest classifier using the features
(X) and labels (y). It initializes a RandomForestClassifier object, assigns it
to rf_classifier, and fits the classifier to the training data using the fit()
method:
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X, y)
return rf_classifier
This function evaluates and returns the accuracy score of the random forest
classifier (rf_classifier) on the test features (X_test) and labels (y_test). It
predicts the labels for the test data using the predict() method of the
classifier, compares the predicted labels with the true labels, and calculates
the accuracy score using the accuracy_score() function:
y_pred = rf_classifier.predict(X_test)
return accuracy
246
We can now call the preceding functions using the actual Iris dataset:
size=0.2, random_state=42)
[Out]: 1.00
The output shows that the model has accurately predicted every case,
yielding an accuracy rate of 100%. A comparison between the actual and
predicted values for the first five rows demonstrates this perfect match
between the predicted and actual values:
[In]: print("Predicted\tActual")
print(f"{y_pred[i]}\t\t{y_test[i]}")
[Out]:
Predicted Actual
1
247
feature_importances = rf_classifier.feature_importances_
feature_names = load_iris().feature_names
sorted_feature_importance_tuples = sorted(feature_importance_tuples,
print(f"{feature_name}: {importance}")
[Out]:
return spark.createDataFrame(pandas_df)
rf = RandomForestClassifier(labelCol="label",
featuresCol="features")
model = rf.fit(data)
return model
predictions = model.transform(data)
evaluator =
MulticlassClassificationEvaluator(labelCol="label",
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
We are now in a position to call the previous functions with the aim of
obtaining the accuracy metric to assess the model’s performance:
[In]: X = iris.data
[In]: y = iris.target
[In]: df['label'] = y
getOrCreate()
250
outputCol="features")
results = predictions.select("label",
"prediction").limit(n_rows).toPandas()
print(results)
[Out]:
Label
Prediction
0
0
251
[In]: print(accuracy)
[Out]: 0.97
The output indicates that taking the entire sample of data points, the random
forest model achieves a good accuracy of approximately 97%. As a point of
reference, in the previous chapter, the decision tree classifier achieved a
lower accuracy of 94%.
Finally, we can determine the features that have contributed the most to
achieving this accuracy by utilizing the featureImportances attribute in
PySpark.
.[Out]:
252
Scikit-Learn
"""
Returns:
dataset.
"""
iris = load_iris()
X = iris.data
y = iris.target
return X, y
[In]: def build_random_forest(X, y):
"""
253
Returns:
classifier.
"""
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X, y)
return rf_classifier
"""
Args:
rf_classifier (RandomForestClassifier): A trained Random
Forest classifier.
Returns:
data.
"""
y_pred = rf_classifier.predict(X_test)
return accuracy
[In]: X, y = load_iris_dataset()
test_size=0.2, random_state=42)
254
[In]: print("Predicted\tActual")
for i in range(5):
print(f"{y_pred[i]}\t\t{y_test[i]}")
PySpark
"""
Parameters:
pandas_df (pd.DataFrame): The pandas DataFrame to be
converted.
a Spark DataFrame.
Returns:
pandas DataFrame.
"""
return spark.createDataFrame(pandas_df)
255
"""
Parameters:
training data.
Returns:
Forest model.
"""
rf = RandomForestClassifier(labelCol="label",
featuresCol="features")
model = rf.fit(data)
return model
"""
Parameters:
to be evaluated.
Returns:
"""
predictions = model.transform(data)
evaluator =
MulticlassClassificationEvaluator(labelCol="label",
predictionCol="prediction", metricName="accuracy")
256
"""
Parameters:
5.
Returns:
None
"""
results = predictions.select("label",
"prediction").limit(n_rows).toPandas()
print(results)
# Load the Iris dataset
[In]: X = iris.data
[In]: y = iris.target
[In]: df['label'] = y
# Create a SparkSession
getOrCreate()
257
Summary
258
CHAPTER 10
Support Vector
Machine Classification
and PySpark
The hyperplane acts as a decision boundary, with one class on each side.
The margin represents the perpendicular distance between the hyperplane
and the closest points of each class. A larger margin indicates a better
separation, while a smaller margin suggests a less optimal decision
boundary.
However, SVMs heavily rely on the choice of the kernel function and its
associated parameters. Selecting an inappropriate kernel or misconfiguring
the parameters can result in suboptimal outcomes. Additionally, training an
SVM, particularly with 259
In this chapter, we build, train, and evaluate an SVM classifier and use it to
predict whether a tumor is malignant or benign. This is based on the Breast
Cancer Wisconsin (Diagnostic) dataset, which is widely used for exploring
and evaluating classification algorithms. It offers a real-world scenario
where machine learning techniques can aid in distinguishing between
benign and malignant tumors based on measurable characteristics.
URL: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+
diagnostic
260
Date: 1995
We can now explore the dataset. First, we can use the Pandas shape
attribute to retrieve the dimensions (number of rows and columns) of the
DataFrame:
[In]: pandas_df.shape
The dataset has 569 rows and 31 columns (30 features plus the target
variable).
We can use the Pandas columns attribute to access the column labels or
names of the DataFrame:
[In]: print(pandas_df.columns)
[Out]:
'target'], dtype='object')
261
[In]: print(selected_columns)
[Out]:
mean radius
mean texture
mean perimeter
mean area
17.99
10.38
122.8
1001
20.57
17.77
132.9
1326
19.69
21.25
130
1203
11.42
20.38
77.58
386.1
20.29
14.34
135.1
1297
We can also output an array containing the unique values of the target
column:
[In]: pandas_df['target'].unique()
[Out]: [0 1]
The output indicates that the target variable is either 0 (tumor is benign) or
1 (tumor is malignant or cancerous).
PySpark does not have a built-in shape attribute like Pandas to retrieve the
dimensions (number of rows and columns) of the spark_df. Therefore, we
need to perform two operations: count() and len() to return the number of
rows and columns, respectively:
[In]: print(spark_df_shape)
262
[Out]:
[In]: spark_df.select(selected_columns).show(5)
[Out]:
mean radius
mean texture
mean perimeter
mean area
17.99
10.38
122.8
1001
20.57
17.77
132.9
1326
19.69
21.25
130
1203
11.42
20.38
77.58
386.1
20.29
14.34
135.1
1297
Finally, we can use the distinct() function to get the unique values of the
target column:
[In]: spark_df.select('target').distinct().show()
[Out]:
263
There are two values for the target variable: 0 indicating a benign tumor and
1
The exploratory data analysis done in the previous section was helpful as it
helped us learn about the structure and characteristics of the data.
An SVM model with a linear kernel is trained on the scaled training data.
The trained model is used to make predictions on the test set. Various
evaluation metrics, including accuracy, precision, recall, F1 score, and area
under the receiver operating characteristic (ROC) curve, are calculated to
assess the performance of the model. Finally, the evaluation metrics are
printed.
264
These imports provide the necessary functions and tools to load the Breast
Cancer Wisconsin dataset, split it into training and testing sets, apply
feature scaling, train an SVM model, and evaluate its performance using
various metrics.
[In]: X = data.data
[In]: y = data.target
This code loads the breast cancer dataset, separates the input features into
X, and assigns the corresponding target values to y.
test_size=0.2, random_state=42)
In this code, the train_test_split() function is used to split the data into
training and testing datasets. This function takes the input features X and
the target values y as arguments and splits them into four separate datasets:
X_train, X_test, y_train, and y_test. The test_size parameter is set to 0.2,
which means that 20% of the data will be allocated for testing, and the
remaining 80% will be used for training the model. The random_state
parameter is set to 42. This parameter ensures reproducibility by fixing the
random seed. It means that each time you run the code, you will get the
same split of data into training and testing sets.
265
In this code, the StandardScaler() class from the Scikit-Learn library is used
to standardize the feature data. First, an instance of the StandardScaler class
is created and assigned to the variable scaler. Next, the fit_transform()
method is called on the scaler object with X_train as the argument. This
method computes the mean and standard deviation of each feature in the
X_train dataset and then applies the transformation to standardize the data.
The standardized feature data is assigned to the variable X_train_scaled.
Next, the fit() method is called on the svm object, with the standardized
training data (X_train_scaled) and the corresponding target values (y_train)
as arguments. This method trains the SVM classifier on the provided data.
266
with the true labels (y_test) and returns the fraction of correctly
predicted instances.
267
true positive rate and the false positive rate. A higher AUC indicates
[In]: print(cm)
[Out]:
Accuracy: 0.96
Precision: 0.97
Recall: 0.96
F1 Score: 0.96
Confusion Matrix:
[[41 2]
[ 3 68]]
The interpretation of this output is as follows:
positive and negative classes. The higher the AUC, the better the
• Precision: With a precision score of 0.97, the SVM classifier has a high
proportion of true positive predictions (correctly predicted positive
268
• Recall: The recall score of 0.96 indicates that the SVM classifier
following format:
We can now write PySpark code to build, train, and evaluate a random
forest classifier, similar to the one we constructed with Scikit-Learn using
the same cancer dataset. Just as we did in Scikit-Learn, the model is
constructed with the default hyperparameters.
Since the two platforms use different default hyperparameters, the results
between them may vary. In Chapter 16, we demonstrate how to fine-tune
the hyperparameters of an algorithm.
269
[In]: y = data.target.tolist()
These two lines of code transform the feature data and target labels from the
breast cancer dataset into a suitable format (X and y) that can be used for
training the SVM
classifier.
zip(X, y) combines the feature data X and target labels y into a list of
tuples, where each tuple contains a feature vector and its corresponding
label. The createDataFrame() method creates a Spark DataFrame (spark_df)
by taking two arguments: the zipped list of tuples and the column names
[“features”, “label”].
270
outputCol="scaledFeatures")
The StandardScaler scales the features of the cancer dataset. This process
helps to normalize the features and bring them to a similar scale. The
inputCol=”features”
specifies the name of the input column containing the features in the
DataFrame, while the outputCol=”scaledFeatures” specifies the name of the
output column that will contain the scaled features in the transformed
DataFrame. The transformed DataFrame will have a new column named
“scaledFeatures” that will store the scaled values.
for binary classification. It seeks to find the hyperplane that best separates
the data points of different classes by maximizing the margin between the
classes. svm is an instance of the LinearSVC classifier.
The next line of code, svm_model = svm.fit(train_data_scaled), trains the
LinearSVC model using the training data that has been scaled using the
StandardScaler.
model variable, which can be used later for making predictions on new,
unseen data.
This code generates predictions for the test data using the trained SVM
model (svm_
by the model.
BinaryClassificationEvaluator.
272
count() counts the number of true positive (tp) predictions. It filters the
predictions DataFrame to select the rows where the true label
count() counts the number of false positive (fp) predictions. It filters the
predictions DataFrame to select the rows where the true label
is 0 and the predicted label is 1, and then counts the number of
such rows.
count() counts the number of true negative (tn) predictions. It filters the
predictions DataFrame to select the rows where the true label is
0 and the predicted label is also 0, and then counts the number of
such rows.
273
filters the predictions DataFrame to select the rows where the true
label is 1 and the predicted label is 0, and then counts the number of
such rows.
accuracy of the model. It divides the sum of true positives and true
It divides the true positives by the sum of true positives and false
matrix as a list of lists. The first list represents the true negative (tn) and
false positive (fp) counts, and the second list represents the false negative
(fn) and true positive (tp) counts.
These evaluation metrics and the confusion matrix help assess the
performance of the SVM model by quantifying its accuracy, precision,
recall, and F1 score, as well as providing a breakdown of the predictions
into different categories based on true and predicted labels.
274
print(row)
[Out]:
Accuracy: 0.9603
Precision: 0.9605
Recall: 0.9733
F1 Score: 0.9669
Confusion Matrix:
[48, 3]
[2, 73]
• Area Under ROC (0.99): The area under the receiver operating
model has achieved a very high AUC close to 1, suggesting that it has
the model is able to correctly predict the class for approximately 96%
275
false negatives). The recall value of 0.97 indicates that the model is
positives, true negatives, and false negatives. The confusion matrix of the
model is as follows:
[48, 3]
[2, 73]
The top-left value (48) represents the number of true negatives (TN),
meaning the instances correctly predicted as negative. The top-right value
(3) represents the number of false positives (FP), meaning the instances
incorrectly predicted as positive.
The bottom-left value (2) represents the number of false negatives (FN),
meaning the instances incorrectly predicted as negative. The bottom-right
value (73) represents the number of true positives (TP), meaning the
instances correctly predicted as positive.
276
[In]: X = data.data
[In]: y = data.target
size=0.2, random_state=42)
277
[In]: print(cm)
PySpark
278
[In]: y = data.target.tolist()
outputCol="scaledFeatures")
279
print(row)
Summary
This chapter introduced support vector machines (SVMs) using the Breast
Cancer dataset. It used Pandas, Scikit-Learn, and PySpark for data
processing, exploration, and machine learning. The chapter discussed the
advantages and disadvantages of SVMs, as well as the kernel trick for
handling nonlinearly separable data. We constructed, trained, and evaluated
the linear SVM classifier, using it to predict whether a tumor is malignant
or benign. This was fitting since SVMs excel in binary classification, and
the breast cancer data consists of binary labels: 0 (tumor is benign) and 1
(tumor is cancerous). The model achieved a high accuracy rate.
280
CHAPTER 11
and PySpark
In this chapter, we explore the concept of Naive Bayes and its application in
multiclass classification tasks. The algorithm estimates the probability
distribution of the features for each label and employs Bayes’ theorem to
determine the probability of each label given the features. To implement
and evaluate the Naive Bayes classifier, we utilize the Scikit-Learn and
PySpark libraries. The analysis will highlight the similarities between the
two libraries, enabling a seamless transition from Scikit-Learn to PySpark
281
The Dataset
[In]: X, y = make_classification(
n_samples=1000,
n_features=3,
n_informative=3,
n_redundant=0,
n_clusters_per_class=1,
n_classes=3,
flip_y=0.1,
random_state=42
[In]: X = np.abs(X)
'Feature 3'])
[In]: pandas_df['label'] = y
classification.
282
• weights=[0.3, 0.3, 0.4]: Specifies the class weights. Class 1 and Class 2
have a weight of 0.3 each, and Class 3 has a weight of 0.4. This is done to
impact the balance of class distribution in the dataset.
The X = np.abs(X) line ensures that all feature values are non-negative by
taking the absolute value of the feature matrix X. This is to ensure that the
features are nonnegative, as Naive Bayes assumes non-negative inputs. The
last two lines create a Pandas DataFrame named pandas_df to organize the
dataset.
[In]: print(pandas_df.head())
[Out]:
Feature 1
Feature 2
Feature 3
label
0.23
0.39
0.89
0.75
1.75
3.16
0.32
0.09
0.96
4.28
1.59
1.66
0.32
0.74
1.15
283
[In]: spark_df.show(5)
[Out]:
Feature 1
Feature 2
Feature 3
label
0.23
0.39
0.89
0.75
1.75
3.16
0.32
0.09
0.96
4.28
1.59
1.66
0.32
0.74
1.15
0
We can see that the top five rows generated by the Pandas head() and
PySpark show() methods are the same, with the only difference being that
Pandas generates an index while PySpark doesn’t.
[In]: print(pandas_df.describe())
[Out]:
Feature 1
Feature 2
Feature 3
label
count
1000
1000
1000
1000
mean
1.13
1.12
1.07
1.1
std
0.79
0.72
0.72
0.83
min
0.00
0.01
0.00
25%
0.50
0.60
0.50
50%
0.98
1.03
0.99
1
75%
1.61
1.48
1.52
max
4.28
3.90
3.95
284
[In]: spark_df.summary().show()
[Out]:
summary
Feature 1
Feature 2
Feature 3
label
count
1000
1000
1000
1000
mean
1.13
1.12
1.07
1.1
stddev
0.79
0.72
0.72
0.83
min
0.00
0.01
0.00
25%
0.50
0.60
0.50
50%
0.98
1.03
0.99
75%
1.61
1.48
1.52
max
4.28
3.90
3.95
The output from both datasets is a summary statistics table with four
columns: Feature 1, Feature 2, Feature 3, and label. Each row in the table
provides specific statistics for each column, and the columns represent the
following information:
column. In this case, the dataset contains 1,000 data points, and all
• mean: This shows the average value for each column. It represents
the arithmetic mean of all the values in each column. For example,
shows the smallest value present in each column. For example, the
285
• 0.25, 0.50, 0.75: These rows represent the quartiles of the data in
each column. The 0.25 row corresponds to the first quartile (25th
percentile), the 0.50 row corresponds to the second quartile (median
or 50th percentile), and the 0.75 row corresponds to the third quartile (75th
percentile). For example, the first quartile for Feature 1 is
approximately 0.5, the median is 0.98, and the third quartile is 1.61.
• max: This row displays the maximum value in each column. It shows
the largest value present in each column. For example, the maximum
Let’s now calculate the distribution of the target variable using Pandas
value_
[In]: print(pandas_df["label"].value_counts())
[Out]:
302
296
402
OceanofPDF.com
[In]: spark_df.groupBy("label").count().show()
[Out]:
label count
302
296
402
We can see from the output that summing up across the three classes adds up
to a total count of 1,000. Labels 0 and 1 each contribute about 30%, and
label 2’s share is about 40% of the total count. This distribution makes the
sample moderately balanced.
286
In this section, we build, train, and evaluate a Naive Bayes classifier using
Scikit-Learn and PySpark libraries. The code utilizes the classification
dataset we generated in the previous section, which is split into training and
testing sets. A Naive Bayes classifier is trained and then used to make
predictions on the test set. Various evaluation metrics including accuracy,
precision, recall, and F1 score are calculated to assess the performance of the
model.
score, f1_score
In this step, we import the essential libraries required for our classification
task.
These include numpy to work with array data; train_test_split to split the
dataset into training and testing sets; GaussianNB, which is the Gaussian
Naive Bayes classifier we are using for classification; and accuracy_score,
precision_score, recall_score, and f1_score to assess the model’s
performance.
[In]: X, y = make_classification(
n_samples=1000,
n_features=3,
n_informative=3,
n_redundant=0,
n_clusters_per_class=1,
n_classes=3,
flip_y=0.1,
random_state=42
[In]: X = np.abs(X)
287
size=0.2, random_state=42)
In this step, we split the generated data into training and testing sets using
train_
test_split. This separation is crucial for training the model on 80% of the
data (X_train, y_train) and evaluating its performance on the remaining 20%
(X_test, y_test) to ensure it generalizes well. This 80-20 split helps us assess
the model’s ability to make accurate predictions on unseen data.
[In]: nb = GaussianNB()
[In]: nb.fit(X_train, y_train)
Here, we create an instance of the Gaussian Naive Bayes classifier (nb) and
fit it to the training data (X_train, y_train) using the fit() method.
In this step, we use the trained model (nb) to make predictions on the test
data (X_test) by employing the predict() method.
[Out]:
288
Precision: 0.45
Recall: 0.44
F1-score: 0.43
averaging strategy
averaging strategy
which means that the model correctly predicted 44% of the instances
score of 0.44 suggests that the model successfully identified 44% of all
actual positive instances.
The following code builds, trains, and evaluates the Naive Bayes classifier
using PySpark: Step 1: Import necessary libraries
In this first step, the required libraries and modules are imported to set up the
environment for the machine learning tasks. These include SparkSession to
work with Spark’s DataFrame-based API, NaiveBayes to use the Naive
Bayes classifier from PySpark’s machine learning library, Vectors for
handling feature vectors in Spark, make_
290
[In]: X, y = make_classification(
n_samples=1000,
n_features=3,
n_informative=3,
n_redundant=0,
n_clusters_per_class=1,
n_classes=3,
flip_y=0.1,
random_state=42
[In]: X = np.abs(X)
The dataset is split into training and testing sets using randomSplit. 80% of
the data is allocated for training (train_df), and 20% is reserved for testing
(test_df). The seed=42
ensures reproducibility.
291
The featuresCol parameter specifies the input features column, the labelCol
parameter specifies the target label column, and the modelType parameter is
set to
Next, we fit the Naive Bayes model to our training data (represented by
train_df) using the fit() method. This process involves the model learning
from the training data to make predictions on new, unseen data.
The trained model (nb_model) is used to make predictions on the test data
(test_df) using the PySpark transform() method. This results in a DataFrame
called predictions.
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
[In]: f1 = evaluator.evaluate(predictions)
[Out]:
Accuracy: 0.39
Precision: 0.22
Recall: 0.39
F1-score: 0.23
292
For each metric, the evaluator is applied to the model’s predictions generated
on the test data. The results are then stored in their respective variables:
accuracy, precision, recall, and f1. These metrics provide insights into
different aspects of the model’s performance, such as its ability to correctly
classify instances, account for class imbalance (weighted metrics), and
balance precision and recall (F1 score).
• Accuracy (0.39): This tells us that the model correctly predicted 39%
• Precision (0.22): A precision of 0.22 indicates that only 22% of the positive
predictions made by the model were correct.
and recall. In this case, the F1 score is 0.23, which suggests that the model
has a trade-off between precision and recall.
293
Chapter 11 Naive Bayes ClassifiCatioN with paNdas, sCikit-learN, aNd
pyspark Scikit-Learn
[In]: X, y = make_classification(
n_samples=1000,
n_features=3,
n_informative=3,
n_redundant=0,
n_clusters_per_class=1,
flip_y=0.1,
random_state=42
[In]: X = np.abs(X)
Step 3: Split the data into training and testing sets
size=0.2, random_state=42)
[In]: nb = GaussianNB()
294
PySpark
[In]: spark =
SparkSession.builder.appName("NaiveBayesMultiClassExample").
getOrCreate()
Step 3: Generate the dataset with non-negative features and three classes
[In]: X, y = make_classification(
n_samples=1000,
n_features=3,
n_informative=3,
n_redundant=0,
n_clusters_per_class=1,
n_classes=3,
295
flip_y=0.1,
random_state=42
[In]: X = np.abs(X)
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
296
[In]: f1 = evaluator.evaluate(predictions)
In the next chapter, we will introduce a new type of supervised learning: the
Multilayer Perceptron (MLP). This is a powerful approach to machine
learning widely used for classification tasks. MLP classifiers have shown
their effectiveness in solving classification problems across different fields
because they can capture intricate dependencies and nonlinear relationships
within the data.
297
CHAPTER 12
Neural Network
Classification
and PySpark
Insufficient data can lead to overfitting. Even with large datasets, without
proper regularization and hyperparameter tuning, MLPs can easily overfit to
the training 299
The Dataset
300
Chapter 12 Neural Network ClassifiCatioN with paNdas, sCikit-learN, aNd
pyspark By importing load_digits, we gain access to the function that allows
us to load the handwritten digits dataset.
This line of code calls the load_digits() function and assigns the returned
dataset to the variable named dataset.
Step 3: Create a Pandas DataFrame from the data starting with the features
feature_names)
We can now start learning about the dataset, beginning with Pandas. The
shape attribute shows that the dataset has 1,797 rows and 65 columns (64
features plus the target variable):
[In]: print(pandas_df.shape)
[In]: print(pandas_df.columns)
[Out]:
301
dtype='object')
This output represents the column names or labels for the features (pixels)
and the target variable. pixel_0_0, pixel_0_1, ..., pixel_7_7 are the column
names for the features (pixels) of the dataset. Each column represents a
specific pixel in the images of handwritten digits. The format pixel_x_y
indicates the location of the pixel within an image. Here, x represents the
row (0 to 7), and y represents the column (0 to 7) of the pixel. In other
words, for each row in the dataset, we will have values associated with these
columns corresponding to the pixel intensity at the given (x, y) position
within the digit image.
The target column is used to store the target labels or the digit labels
associated with each image in the dataset. In our handwritten digit
recognition task, this column contains the actual digit (0 through 9) that the
corresponding image represents. For example, if we have an image of the
digit 3, the value in the target column for that row would be 3.
We can print the first five rows and first five columns of pandas_df as
follows:
[Out]:
pixel_0_0
pixel_0_1
pixel_0_2
pixel_0_3
pixel_0_4
13
0
12
13
15
15
13
11
In this output, each column value represents the grayscale intensity or
darkness of the pixel at the corresponding position within an image. The
intensity values range from 0 (completely white) to 16 (completely black).
We can confirm the min and max values as follows:
302
[Out]: [0 1 2 3 4 5 6 7 8 9]
This output shows that there are ten digits ranging from 0 to 9.
We can display the first five images from the load_digits dataset using
Matplotlib and Scikit-Learn using the following steps:
dataset = load_digits()
Step 3: Display the first five images
ax.imshow(dataset.images[i], cmap='gray')
ax.set_title(f"Label: {dataset.target[i]}")
ax.axis('off')
[In]: plt.tight_layout()
[In]: plt.show()
[Out]:
303
Finally, the following for loop iterates over the five subplots, assigning each
subplot to the variable ax:
dataset.target array.
• ax.axis(‘off’) turns off the axis ticks and labels for the subplot.
overlapping.
The code first imports the SparkSession, which is the entry point for using
Apache Spark’s functionality. It then creates a SparkSession object named
spark. The builder method is used to configure and create a Spark Session.
Calling getOrCreate() ensures that if a Spark Session already exists, it will
reuse that session. If not, it will create a new one. In the final step, the code
converts the Pandas DataFrame (pandas_df) into a PySpark DataFrame
(spark_df) using the createDataFrame() method provided by the
SparkSession.
This indicates that the dataset has 1,797 rows and 65 columns (64 features or
8×8
pixels and a target variable, which is the handwritten digit ranging from 0 to
9).
[In]: print(spark_df.columns)
[Out]:
Now, let’s print the first five rows and first five columns of the PySpark
DataFrame just like we did with Pandas:
[In]: spark_df.select(spark_df.columns[:5]).show(5)
[Out]:
+---------+---------+---------+---------+---------+
|pixel_0_0|pixel_0_1|pixel_0_2|pixel_0_3|pixel_0_4|
+---------+---------+---------+---------+---------+
305
+---------+---------+---------+---------+---------+
In the following steps, we show the distinct values of the target variable:
[In]: distinct_values = spark_df.select(spark_df.columns[-1]).distinct().
orderBy(spark_df.columns[-1])
[In]: distinct_values.show()
[Out]:
+-----+
|label|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+-----+
To retrieve the distinct values from the target column as shown previously,
we first selected the last column of the DataFrame using
spark_df.columns[-1]. Then, we applied the distinct() method to retrieve a
new DataFrame containing only the unique values from the selected column.
We used the orderBy() method to sort the distinct values in ascending order
based on the last column. The resulting DataFrame, containing the distinct
values in ascending order, is stored in the variable distinct_values. Finally,
the show() method is called on the distinct_values DataFrame.
306
When making predictions using the MLP model for the handwritten digit
• Input layer: The 8×8 pixel representation of a digit image is flattened into a
1D vector and serves as the input to the MLP model. Each pixel
input data.
• Output layer: The final hidden layer is connected to the output layer, which
produces the predictions for each digit class. The number of
classes (in our case, digits 0–9). Each output neuron represents
particular class.
307
• Prediction: The digit class with the highest probability or confidence score
in the output layer is selected as the predicted class for the
input digit.
During the training process, the MLP model adjusts the weights of the
connections between neurons using backpropagation and optimization
algorithms such as gradient descent. This iterative process aims to minimize
the difference between the predicted outputs and the true labels in the
training data, improving the model’s ability to make accurate predictions. By
learning from a large number of labeled examples, the MLP model develops
an understanding of the underlying patterns and features that distinguish
different handwritten digits. This enables it to generalize and make
predictions on unseen digit images with reasonable accuracy.
• Split the data: Split the dataset into training and testing sets. This division
allows us to train the MLP classifier on a portion of the data and evaluate its
performance on unseen data. Typically, around 80%
of the data is used for training, while the remaining 20% is used for
testing.
• Create and train the MLP classifier: Initialize an MLP classifier, and
configure the desired number of hidden layers and neurons in each
layer. Train the MLP classifier using the training data, where the
• Predict the digits: Apply the trained MLP classifier to the testing set to
make predictions. The model will take the input features (pixel values
of handwritten digits) and assign a predicted label to each sample.
308
true labels of the testing set to assess the performance of the MLP
The fact that both Scikit-Learn and PySpark follow similar guidelines for
deep learning implementation will assist data scientists who aim to migrate
from Scikit-Learn to PySpark. This migration will enable the deep learning
practitioners to leverage PySpark’s powerful distributed computing API in
significant ways. While deep learning frameworks like TensorFlow, Keras,
and PyTorch are commonly used for training deep neural networks, Spark
can complement these frameworks in several key areas:
309
We first demonstrate the use of the Scikit-Learn library for training and
evaluating a neural network classifier on the digits dataset. We will follow
these steps: Step 1: Import the necessary libraries
In this step, the digits dataset is loaded using load_digits(), and the data is
assigned to the variable dataset:
The data is split into training and testing sets using train_test_split(). The
input features are stored in X_train and X_test, while the corresponding
target values are stored in y_train and y_test. The test_size parameter
specifies the proportion of the data to be allocated for testing (20%), and
random_state 42 ensures reproducibility of the split.
310
Chapter 12 Neural Network ClassifiCatioN with paNdas, sCikit-learN, aNd
pyspark In this step, a neural network classifier is created using
MLPClassifier(), with the hidden_layer_sizes parameter specifying the
architecture of the neural network. There are two hidden layers, each with 64
neurons. The classifier is then trained on the training data using the fit()
method, with X_train as the input features and y_train as the target values.
The predictions are made on the test set using predict(), and the predicted
values are stored in y_pred.
[In]: print(confusion_mat)
More evaluation metrics (precision, recall, and F1 score) are then computed
using precision_score(), recall_score(), and f1_score(), respectively. These
metrics are calculated by comparing the predicted values with the true
values. The precision, recall, and F1 score values are printed using the
print() method.
311
[Out]:
Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F1 Score: 0.98
Confusion Matrix:
[[32 0 0 0 1 0 0 0 0 0]
[ 0 27 1 0 0 0 0 0 0 0]
[ 0 0 32 0 0 0 0 1 0 0]
[ 0 0 0 33 0 1 0 0 0 0]
[ 0 0 0 0 46 0 0 0 0 0]
[ 0 0 0 0 0 46 1 0 0 0]
[ 1 0 0 0 0 0 34 0 0 0]
[ 0 0 0 0 0 0 0 33 0 1]
[ 0 1 0 0 0 0 0 0 29 0]
[ 0 0 0 0 0 0 0 1 0 39]]
calculated for each class and then averaged. The precision score is
• Recall (0.98): This metric, also known as sensitivity or true positive rate,
measures the classifier’s ability to identify positive instances
312
Chapter 12 Neural Network ClassifiCatioN with paNdas, sCikit-learN, aNd
pyspark negatives). Like precision, recall is calculated for each class and
then averaged. The recall score here is 0.98, indicating that, on average, the
classifier achieved a recall of approximately 98% across all classes.
another class.
another class.
another class.
313
another class.
another class.
another class.
another class.
another class.
314
In this first step, each of the imported libraries has its own functionality:
SparkSession is used to create a Spark Session, VectorAssembler is used to
assemble the input features into a vector column,
MultilayerPerceptronClassifier is the MLP classifier used for classification,
MulticlassClassificationEvaluator is used to evaluate the performance of the
classifier, pd is used to create a Pandas DataFrame, and load_digits is used to
load the handwritten digits dataset.
getOrCreate()
Step 3: Load the digits dataset
In this code, the first line creates a DataFrame called df using the
DataFrame() function from the Pandas library. The argument
data=dataset.data specifies the data for the DataFrame, which corresponds to
the pixel values of the digit images. The argument
columns=dataset.feature_names sets the column names of the DataFrame as
the feature names provided in the dataset object. These feature names
represent the pixel positions in the image.
315
This line of code creates a PySpark DataFrame called spark_df from the
existing Pandas DataFrame df using the createDataFrame() method. By
converting the Pandas DataFrame to a PySpark DataFrame, we can leverage
the distributed computing capabilities of Spark and apply scalable operations
on the data using Spark’s built-in functions and machine learning algorithms.
This is particularly useful when working with large datasets that may not fit
into memory on a single machine.
In this step, we split the data into training and testing sets:
This line of code splits the PySpark DataFrame spark_df into two separate
DataFrames, train and test, for training and testing the MLP classifier,
respectively. The randomSplit() method allows us to randomly partition a
DataFrame into multiple parts based on the given weights. The weights [0.8,
0.2] indicate that approximately 80% of the data will be assigned to the train
DataFrame, and the remaining 20% will be assigned to the test DataFrame.
The seed=42 parameter sets the random seed for reproducibility. By
providing the same seed value, we ensure that the data split remains
consistent across multiple runs of the code, which is useful for comparison
and debugging purposes.
outputCol="features")
316
Chapter 12 Neural Network ClassifiCatioN with paNdas, sCikit-learN, aNd
pyspark This code creates a Vector Assembler in three steps:
vector column.
VectorAssembler.
step extracts the names of the columns that represent the input
assembler = VectorAssembler(inputCols=input_features,
the input features from the specified columns and creates a new
317
2. A
those layers.
dataset.target_names.
classifier = MultilayerPerceptronClassifier(layers=laye
MLP model, which was defined earlier. seed=42 sets the random
In this line of code, the fit() method is called on the classifier object to train
the neural network model using the training data. The train_transformed is
the training data obtained after applying the VectorAssembler transformation
to the original training data. The fit() method uses this preprocessed training
data to optimize the weights and biases of the MLP through an iterative
process called backpropagation. During training, the model learns to map the
input features to the target labels by adjusting its internal parameters.
After the training process is completed, the fit() method returns a trained
model object, which is assigned to the model variable. This trained model
can then be used to make predictions on new, unseen data.
In this line of code, the transform() method is called on the trained model
object, passing the test_transformed DataFrame as the input. The code
generates predictions for the test data using the trained neural network
model.
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
[In]: accuracy = evaluator.evaluate(predictions)
"weightedPrecision")
"weightedRecall")
[In]: f1 = evaluator_f1.evaluate(predictions)
319
count().fillna(0).orderBy("label")
[In]: confusion_matrix.show(truncate=False)
[Out]:
Accuracy: 0.95
Precision: 0.96
Recall: 0.95
F1 Score: 0.95
+-----+---+---+---+---+---+---+---+---+---+---+
|label|0.0|1.0|2.0|3.0|4.0|5.0|6.0|7.0|8.0|9.0|
+-----+---+---+---+---+---+---+---+---+---+---+
|0 |32 |0 |0 |0 |1 |0 |0 |0 |0 |0 |
|1 |0 |30 |0 |0 |0 |0 |0 |0 |1 |0 |
|2 |0 |1 |40 |0 |0 |0 |0 |0 |0 |0 |
|3 |0 |0 |0 |36 |0 |0 |0 |0 |1 |1 |
|4 |0 |1 |0 |0 |36 |0 |0 |0 |0 |0 |
|5 |0 |0 |0 |0 |0 |35 |0 |0 |0 |0 |
|6 |1 |0 |0 |0 |2 |0 |37 |0 |1 |0 |
|7 |0 |0 |0 |0 |0 |0 |0 |33 |2 |1 |
|8 |0 |0 |0 |1 |0 |0 |0 |0 |28 |0 |
|9 |0 |0 |0 |1 |0 |0 |0 |1 |0 |30 |
+-----+---+---+---+---+---+---+---+---+---+---+
In this final step, several evaluation metrics are calculated, and a confusion
matrix is computed based on the predictions generated by the trained MLP
model. The following is the three-step process of creating the evaluators,
calculating the evaluation metrics, and printing the evaluation metrics:
the accuracy metric. The evaluator will be used to evaluate the accuracy of
the model’s predictions.
320
• evaluator_precision = MulticlassClassificationEvaluator
• evaluator_recall = MulticlassClassificationEvaluator(metricName=
• evaluator_f1 = MulticlassClassificationEvaluator(metricName=“f1”)
• confusion_matrix = predictions.groupBy(“label”).pivot(“prediction”).
by the true labels (“label”) and pivoted against the predicted labels
321
matrix with each true label (“label”) as rows and the predicted
predicted labels.
Once the code in this step is executed, the evaluation metrics are displayed.
The preceding output can be interpreted as follows:
the dataset.
• The confusion matrix shows the counts of true positive, true negative, false
positive, and false negative for each class. The MLP’s
results for digits 0 to 9. The row headers represent the true labels,
322
in the matrix represents the count of instances where the true label
confusion matrix:
predicted as 4.
predicted as 8.
as 4, and 1 as 8.
misclassified as 7.
323
In this section, we consolidate all the relevant code from the previous steps
into a single code block. This enables the reader to execute the code as a
cohesive unit.
Scikit-Learn
Step 1: Import required libraries
[In]: print(confusion_mat)
PySpark
getOrCreate()
# Create features
325
outputCol="features")
# Calculate accuracy
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
"weightedPrecision")
"weightedRecall")
[In]: f1 = evaluator_f1.evaluate(predictions)
count().fillna(0).orderBy("label")
[In]: confusion_matrix.show(truncate=False)
Summary
In this chapter, we explored the world of Multilayer Perceptron (MLP)
classifiers, a powerful approach to supervised machine learning. We
examined the advantages and disadvantages of MLPs and concluded that
deep learning remains a versatile technique for both regression and
classification.
The chapter illustrated the usage of MLP classifiers, working with the
handwritten digits dataset. We implemented the classifiers using Scikit-
Learn and PySpark libraries and utilized Pandas and PySpark for data
processing and exploration. The trained model achieved very high accuracy,
leading to accurate predictions of handwritten digits. We highlighted that,
since Scikit-Learn and PySpark follow the same guidelines for deep
learning, transitioning to PySpark to harness its distributed capabilities was
worthwhile.
327
328
CHAPTER 13
Recommender Systems
to the ones a user has liked or interacted with in the past. It analyzes the
characteristics or features of items and recommends items with
329
the preferences of similar users. It looks for patterns and similarities among
users’ behaviors and makes recommendations based on those
that are similar to the ones a user has liked and recommends
filtering techniques.
330
The Dataset
The dataset for this project is an open source dataset containing ratings of
various Amazon electronic products provided by UC San Diego’s Computer
Science
URL: http://jmcauley.ucsd.edu/data/amazon/
Contributor: Julian McAuley
We begin by reading the CSV file into a Pandas DataFrame from a GitHub
location where we have stored a copy of the dataset:
'https://raw.githubusercontent.com/abdelaziztestas/'
'spark_book/main/amazon_electronics.csv', names=column_names)
We first import the Pandas library as pd and then define a list called
column_names, which contains the names of the columns for the DataFrame.
Next, we read the CSV
file using the read_csv() method and assign the resulting DataFrame to the
variable pandas_df.
We should now be able to explore the dataset by first looking at the shape of
the DataFrame:
[In]: print(pandas_df.shape)
[Out]: (1048576, 4)
331
columns. We can print the names of these columns by using the Pandas
columns attribute:
[In]: print(pandas_df.columns)
The names of the four columns are user_id, product_id, rating, and
timestamp. The following is a description of each column:
• timestamp: This is the time the user rated the product. It is irrelevant for
this project.
We can display the first five rows of each of these columns by using the
Pandas head() method:
[In]: print(pandas_df.head())
[Out]:
The rating column in the sample data ranges from 1 to 5 stars, where 1 star is
Very Poor, suggesting that the customer had a highly negative experience
with the product, and 5 is Excellent, indicating that the customer had a
highly positive experience with the product. The rating of 3 means Average,
suggesting that the product met the customer’s basic expectations but may
not have exceeded them.
332
• 2 stars: Poor. This indicates dissatisfaction with the product but may not be
as severe as a 1-star rating.
We can confirm that the dataset has ratings that range from 1 to 5 as follows:
[In]: print(sorted_unique_ratings)
[Out]: [1 2 3 4 5]
In this code, we import the NumPy library as np and then create a variable
called sorted_unique_ratings to store the unique values of the rating column
from a Pandas DataFrame (pandas_df). We use NumPy’s sort() function to
sort these unique values in ascending order. Finally, we print the sorted and
unique ratings.
We can check the data types of these columns using the Pandas dtypes
attribute.
[In]: print(pandas_df.dtypes)
[Out]:
user_id object
product_id object
rating int64
timestamp int64
dtype: object
The output indicates that the rating column is numeric while the user_id and
product_id are categorical. For Surprise’s Singular Value Decomposition
(SVD) algorithm, there is no need for manual label encoding or one-hot
encoding of these categorical variables as Surprise is designed to handle
categorical variables like user IDs and item IDs directly. Surprise internally
manages the mapping of these categorical values to numerical indices during
the model training and prediction process.
333
Let’s perform a count of occurrences for each unique value in the user_id
column of the pandas_df DataFrame.
[In]: pandas_result_desc =
pandas_df.groupby('user_id').size().reset_index(name='count').sort_
values(by='count', ascending=False).head(5)
[In]: print(pandas_result_desc)
[Out]:
user_id count
[In]: pandas_result_asc =
pandas_df.groupby('user_id').size().reset_index(name='count').sort_
values(by='count', ascending=True).head(5)
[In]: print(pandas_result_asc)
[Out]:
user_id count
0 A00037441I8XOQJSUWCAG 1
499070 A3EEUT0F9898GS 1
499071 A3EEUZUBZQ4N4D 1
499072 A3EEVD827ZC4JY 1
499073 A3EEW1G63825UL 1
The output indicates that the top user has rated 412 Amazon electronic
products and each user has rated at least 1 product.
index(name='count').sort_values(by='count', ascending=False).head(5)
[In]: print(pandas_result_desc)
[Out]:
product_id count
index(name='count').sort_values(by='count', ascending=True).head(5)
[In]: print(pandas_result_asc)
[Out]:
product_id count
46105 B000BMQOM0 1
17738 B00009R8AJ 1
43278 B000AAJ27W 1
17736 B00009R8AH 1
43279 B000AAJ2CW 1
We can see from the output that the top product has been rated 9,487 times
and each product has been rated at least once.
335
[Out]: (1048576, 4)
In the next step, we display the column names using the PySpark columns
attribute:
[In]: print(spark_df.columns)
[In]: spark_df.show(5)
[Out]:
+--------------+----------+------+----------+
| user_id|product_id|rating| timestamp|
+--------------+----------+------+----------+
+--------------+----------+------+----------+
We can check that there are indeed five unique ratings ranging from 1 to 5
stars as follows:
orderBy(col("rating"))
[In]: unique_sorted_ratings.show()
[Out]:
+------+
|rating|
+------+
| 1|
| 2|
336
| 3|
| 4|
| 5|
+------+
To get these unique ratings, we first import the col function, which is used
for column operations. Next, we select the rating column from the
DataFrame, ensuring that we obtain unique values using the distinct()
method. Afterward, we use the orderBy() function with the col(“rating”)
expression to sort the unique rating values in ascending order. Finally, we
display the result using the show() method.
[In]: spark_df.printSchema()
[Out]:
root
This output confirms the data types indicated by Pandas: two numerical
columns (rating and timestamp) and two categorical columns (user_id and
product_id). The PySpark model requires the user_id and product_id to be in
numerical format, so we will need to make this conversion before training
the model.
[In]: spark_df.groupBy("user_id").count().orderBy("count",
ascending=False).show(5)
[Out]:
+--------------+-----+
| user_id|count|
+--------------+-----+
| A5JLAU2ARJ0BO| 412|
|A231WM2Z2JL0U3| 249|
|A25HBO5V8S8SEA| 164|
| A6FIAB28IS79| 146|
337
| AT6CZDCP4TRGA| 128|
+--------------+-----+
[In]: spark_df.groupBy("user_id").count().orderBy("count",
ascending=True).show(5)
[Out]:
+--------------+-----+
| user_id|count|
+--------------+-----+
| AFS5ZGZ2M3ZGV| 1|
|A19TDRIKW64Z6I| 1|
|A1K775TKUNZL43| 1|
|A2VS1WG9VKINXI| 1|
| A76QA8ID3NTCC| 1|
+--------------+-----+
These results are similar to the output produced by Pandas; that is, the most
active user had 412 occurrence counts while the least active user had 1.
[In]: spark_df.groupBy("product_id").count().orderBy("count",
ascending=False).show(5)
[Out]:
+----------+-----+
|product_id|count|
+----------+-----+
|B0002L5R78| 9487|
|B0001FTVEK| 5345|
|B000I68BD4| 4903|
|B000BQ7GW8| 4275|
|B00007E7JU| 3523|
+----------+-----+
338
[In]: spark_df.groupBy("product_id").count().orderBy("count",
ascending=False).show(5)
[Out]:
+----------+-----+
|product_id|count|
+----------+-----+
|B00004SC3R| 1|
|B00005KHSJ| 1|
|B00004YK3Q| 1|
|B00000J025| 1|
|B000050GDM| 1|
+----------+-----+
The results confirm the Pandas output: the top electronic product has been
rated 9,487, and each product has been rated at least once.
339
This code sets the stage for building and evaluating a recommendation
system using the Surprise library. The typical workflow is to proceed by
loading the dataset, defining the data reader, splitting the data into training
and testing sets using train_test_split, and then training and evaluating the
SVD recommendation model using the imported functions and classes.
In the code provided previously, we first import the Pandas library for data
loading.
We then import Dataset, Reader, and SVD from the surprise library.
• Reader: This defines the format of the input data to be used with
Surprise.
340
abdelaziztestas/'
'spark_book/main/amazon_electronics.csv', names=column_names)
This code involves creating a DataFrame using the Pandas library. The first
line creates a list called column_names that contains the names of the
columns in the DataFrame. The columns are user_id, product_id, rating, and
timestamp. This list will be used to assign column names to the DataFrame.
The second line reads the CSV file from a URL that points to the location of
the CSV file on GitHub and creates a Pandas DataFrame named pandas_df.
The read_csv function is used to read the CSV file. The argument
names=column_names specifies that the names in the column_names list
should be used as the column names for the Pandas DataFrame.
Step 3: Define the rating scale and load data into Surprise Dataset
'rating']], reader)
The purpose of this step is to create a Dataset object, which will later be used
to train and evaluate the recommender system model in Surprise. We can
break this into a two-step process:
data in a specific format, and using the Reader class helps define
that format.
341
This line of code uses the train_test_split function from the Surprise library
to split the data into training and testing sets. It uses data (the Dataset object)
that contains the data to be split as an argument. The test_size=0.2 specifies
the proportion of the data that should be allocated for testing. It is set to 0.2,
which means 20% of the data will be used for testing, and the remaining
80% will be used for training. The resulting train and test datasets are
trainset and testset, respectively. Each contains three elements: user_id,
product_id, and rating.
This line creates an object of the SVD class from the Surprise library and
assigns it to the variable model. SVD is a matrix factorization technique
commonly used in recommender systems.
[In]: model.fit(trainset)
This line of code is used to train the SVD model on the training set using the
fit() method of the model object, which is an instance of the SVD class in the
Surprise library.
The trainset contains user-item ratings required by the SVD model to learn
and make predictions.
During the training process, the SVD model analyzes the user-item
interactions in the training set and learns the underlying patterns and
relationships. It uses the matrix factorization technique to decompose the
rating matrix and estimate latent factors associated with users and items. The
model adjusts its internal parameters to minimize the difference between
predicted ratings and actual ratings in the training set.
342
This code is used to generate predictions using the trained SVD model and
calculate the root mean squared error (RMSE) as a measure of prediction
accuracy.
We can print the predictions for the first five test cases and compare them
with the actual predictions as follows:
rounded_prediction = round(prediction.est, 1)
[Out]:
The code prints predictions for the first five test cases generated by the SVD
recommendation system. It begins with a for loop iterating over the first five
predictions in the test_predictions collection. For each prediction, it
calculates the predicted rating, rounding it to one decimal place. The script
then prints key information: the user ID, item ID, actual rating (ground
truth), and the rounded predicted rating. The code uses formatted strings (f-
strings) for clean and readable output. It includes the following details:
prediction
item pair
343
Even though this is just a sample, it gives us an idea of how the model is
performing.
product_id).est
{new_product_id}: {predicted_rating:.2f}')
[Out]:
In this example, the model predicts a rating of 3.79 for the user
AK6PVIKDN83MW
344
The same Amazon electronic products dataset we used for Scikit-Learn will
be used to train and test the PySpark model.
This code is used to import necessary modules from the PySpark library to
work with the recommendation system. The StringIndexer class is used to
convert categorical variables into numerical indices. This is necessary as
PySpark, unlike Surprise, needs the data in an indexed format. The
StringIndexer will be used to index both user and item IDs. The ALS is the
Alternating Least Squares class used to build the collaborative filtering
algorithm based on matrix factorization. The RegressionEvaluator class is
used to evaluate the performance of the model by calculating the RMSE
(root mean squared error). The SparkSession class is the entry point for
working with structured data in PySpark. It provides the functionality to
create DataFrames and perform various data processing operations.
345
outputCol="user_index")
transform(indexed_data)
• The third line applies the fit() and transform() methods of the
346
indexed, and the resulting indexed values are stored in the user_
index column.
obtained in the previous step. Similar to the previous line, it fits the
transformation on the DataFrame and applies the transformation
indexed, and the resulting indexed values are stored in the product_
index column.
This line of code is used to split the indexed_data DataFrame into training
and testing datasets. The indexed_data DataFrame represents the data where
categorical columns user_id and product_id have been converted into
numerical indices using StringIndexer. To create separate sets for training
and testing, the randomSplit() function is used. This splits the indexed_data
DataFrame into two datasets: train and test. The first argument [0.8, 0.2]
specifies the proportions for the split. In this case, 80%
of the data is allocated to the training set (train), while the remaining 20% is
allocated to the testing set (test). The second argument seed=42 sets a
specific random seed to ensure reproducibility, meaning the split will be the
same every time the code is executed with the same seed value.
347
were not present in the training data. The drop strategy is used here,
which means that any rows with missing user or item indices in the
This line of code is used to train the ALS (Alternating Least Squares) model
on the training dataset (train). It calls the fit() method on the als object,
which is an instance of the ALS algorithm in PySpark. The fit() method
trains the ALS model using the training dataset.
During the training process, the ALS algorithm will optimize the model’s
parameters to minimize the difference between the predicted ratings and the
actual ratings in the training dataset. The algorithm uses an iterative
approach, alternating between updating the user factors and item factors to
find the optimal values that capture the underlying patterns in the data. Once
the training process is completed, the fit() method returns a trained ALS
model, which is assigned to the variable model. This trained model can then
be used to make predictions on new, unseen data or to generate
recommendations based on user-item interactions.
This code is used to generate predictions using the trained ALS model and
evaluate its performance using the RMSE (root mean squared error) metric.
• The first line applies the trained ALS model (model) to the test
the model to the test data and generates predictions for the user-item
set to rating, indicating that the actual ratings are stored in the
rating column.
RMSE between the predicted ratings and the actual ratings in the
variable rmse_score.
• The last line prints the RMSE score. It provides an indication of how well
the ALS model performed in predicting the ratings on the test
ratings.
349
"prediction").show(5)
[Out]:
+--------------+----------+------+----------+
| user_id|product_id|rating|prediction|
+--------------+----------+------+----------+
|A2AY4YUOX2N1BQ|B000050ZS3| 4| 4.310914|
|A2AY4YUOX2N1BQ|B0000665V7| 5| 4.56616|
|A2AY4YUOX2N1BQ|B00008R9ML| 4| 5.730056|
|A2AY4YUOX2N1BQ|B00009R76N| 5| 4.4490294|
|A2AY4YUOX2N1BQ|B00009R7BD| 5| 6.400287|
+--------------+----------+------+----------+
The output indicates discrepancies between the actual and predicted ratings,
although some are relatively small after rounding. For example, user
A2AY4YUOX2N1BQ
As for the RMSE, the reported value is approximately 1.96, which means
that, on average, the predicted ratings by the ALS model deviate from the
actual ratings by approximately 1.96 units. This value provides an indication
of the model’s accuracy in predicting user-item ratings. The closer the
RMSE is to zero, the better the model’s predictions align with the true
ratings.
Let’s now consolidate all the code snippets from the previous steps into a
single code block. This way, the reader can execute the code as a single
block in both Surprise and PySpark.
350
Chapter 13 reCommender SyStemS with pandaS, SurpriSe, and pySpark
Surprise
abdelaziztestas/'
'spark_book/main/amazon_electronics.csv', names=column_names)
'rating']], reader)
[In]: model.fit(trainset)
351
rounded_prediction = round(prediction.est, 1)
product_id).est
{new_product_id}: {predicted_rating:.2f}')
PySpark
outputCol="user_index")
352
transform(indexed_data)
"prediction").show(5)
Summary
In the next chapter, we will delve into yet another area of supervised
learning: natural language processing (NLP), exploring how to analyze and
extract valuable insights from text data.
354
CHAPTER 14
Natural Language
The project of this chapter is to examine the key steps involved in processing
text data using an open source dataset known as the 20 Newsgroups. This is
a collection of newsgroup documents partitioned into 20 different topics. We
build, train, and evaluate a Multinomial Naive Bayes algorithm and use it to
predict the topic categories in this dataset. Even though any other supervised
learning classification model covered in the previous chapters can be used,
Naive Bayes is often the model of choice for NLP tasks such as topic
modeling (the task at hand).
Before training the model, we need to ensure that the text data that feeds into
it is in the correct format. This requires cleaning, tokenization, and
vectorization. Cleaning in the context of NLP is the act of preprocessing text
data by removing irrelevant or noisy information that does not add much to
the meaning, such as punctuation marks and stop words. Tokenization
involves breaking down the text into smaller units called tokens, which can
be words, phrases, or even characters, which helps to organize the text and
enables further analysis. Vectorization encompasses converting the text data
into numerical representations that the machine learning algorithm can
understand. We explore two methods commonly used for this purpose: Bag
of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-
IDF).
355
The Dataset
windows.misc', 'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'comp.windows.x',
OceanofPDF.com
'rec.sport.baseball', 'rec.sport.hockey',
'sci.space', 'soc.religion.christian',
'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']
356
Step 3: Create a Pandas DataFrame from the Scikit-Learn data and target
arrays
[In]: newsgroups_df = pd.DataFrame({'text': newsgroups_all.data, 'label':
newsgroups_all.target})
all object.
Now that we have successfully created a Pandas DataFrame that holds the 20
[In]: newsgroups_df.shape
[Out]: (18846, 2)
This output indicates that the DataFrame has 18,846 rows and 2 columns
(text and label).
Next, we show the top five rows of the DataFrame using the Pandas head()
method:
[In]: print(newsgroups_df.head())
[Out]:
text label
[In]: newsgroups_df.label.nunique()
[Out]: 20
The text column is truncated and doesn’t tell us much about its content. To
print the entire first line, we can run the following code, which extracts the
value of the text column for the first row (index 0) in the Pandas DataFrame:
[In]: print(newsgroups_df.iloc[0]['text'])
[Out]:
I am sure some bashers of Pens fans are pretty confused about the lack of
any kind of posts about the recent Pens massacre of the Devils. Actually, I
am bit puzzled too and a bit relieved. However, I am going to put an end to
non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are
killing those Devils worse than I thought. Jagr just showed you why he is
much better than his regular season stats. He is also a lot fo fun to watch in
the playoffs. Bowman should let JAgr have a lot of fun in the next couple of
games since the Pens are going to beat the pulp out of Jersey anyway. I was
very disappointed not to see the Islanders lose the final regular season game.
PENS RULE!!!
358
[In]: print(shape)
[Out]: (18846, 2)
This output confirms Pandas output (18,846 rows and 2 columns) using the
shape attribute. Since PySpark doesn’t have a built-in shape attribute, we
combined the count() and len() methods to calculate the number of rows and
columns, respectively.
359
[In]: spark_df.show(5)
[Out]:
+--------------------+-----+
| text|label|
+--------------------+-----+
|From: mblawson@mi...| 3|
+--------------------+-----+
The output is truncated. We can specify a larger value for the truncate
parameter in the show() method to increase the truncation limit. For
example, the following code will set truncation to the maximum length,
effectively displaying the full content of each column:
[Out]:
I am sure some bashers of Pens fans are pretty confused about the lack\nof
any kind of posts about the recent Pens massacre of the Devils. Actually,\ nI
am bit puzzled too and a bit relieved. However, I am going to put an end\nto
non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare
killing those Devils worse than I thought. Jagr just showed you why\nhe is
much better than his regular season stats. He is also a lot\nfo fun to watch in
the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple
of games since the Pens are going to beat the pulp out of Jersey anyway. I
was very disappointed not to see the Islanders lose the final\nregular season
game. PENS RULE!!!\n\n|10
360
[In]: print(text_value)
[Out]:
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu
I am sure some bashers of Pens fans are pretty confused about the lack of
any kind of posts about the recent Pens massacre of the Devils. Actually, I
am bit puzzled too and a bit relieved. However, I am going to put an end to
non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are
killing those Devils worse than I thought. Jagr just showed you why he is
much better than his regular season stats. He is also a lot fo fun to watch in
the playoffs. Bowman should let JAgr have a lot of fun in the next couple of
games since the Pens are going to beat the pulp out of Jersey anyway. I was
very disappointed not to see the Islanders lose the final regular season game.
PENS RULE!!!
The first line of code selects the text column from spark_df and retrieves the
first row as a Row object. The select() method is used to select only the text
column from the DataFrame, and first() retrieves the first row as a Row
object. The second line prints the value stored in text_value, which
represents the text content from the text column of the first row in the
DataFrame.
[In]: spark_df.select("label").distinct().count()
[Out]: 20
361
analysis.
There are two main approaches to the last step (vectorization): Bag of Words
(BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
Let’s use a few sentences about visiting the Eiffel Tower in Paris as an
example for creating a BoW representation. In this example, we’ll tokenize
the text, remove punctuation and common stop words, and create a simple
BoW vector:
Sentences:
362
“summer”]
“attractions”]
363
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1]
The formulas for Term Frequency (TF) and Inverse Document Frequency
(IDF) are as follows:
364
Despite its strengths, TF-IDF has certain limitations. One drawback is that it
treats each term independently, failing to capture semantic relationships
between words or phrases in a document. Additionally, TF-IDF relies on
word frequency, which may not always accurately reflect the importance of
terms in more complex language structures.
We can now write some Python code to apply the preprocessing steps to the
20
We also need to download punkt and stopwords resources from the NLTK
library:
[In]: nltk.download('punkt')
[In]: nltk.download('stopwords')
365
The first line of code imports NLTK. The second line imports the
This line of code is used to fetch the 20 Newsgroups dataset and store it in
the variable newsgroups_data using the fetch_20newsgroups function. The
subset=‘all’
parameter specifies which subset of the dataset (train, test, or all) to fetch. In
this case,
‘all’ indicates that all the data from the 20 Newsgroups dataset will be
fetched.
366
[In]: import re
[In]: text = "Check out this website: https://www.example.com. It has great
content."
[In]: print(clean_text)
In the preceding example, we want to remove any URLs from the given text
because it doesn’t add to the quality of the input for the algorithm (on the
contrary, unnecessary patterns can create noise in the data and increase
processing time). We define a regular expression pattern http[s]?://\S+ that
matches URLs starting with either “http://” or
[In]: document_index = 0
data dataset. We have chosen for the sake of illustration to work with just
one document.
367
tokens = nltk.word_tokenize(text)
stop_words = set(nltk.corpus.stopwords.words('english'))
punctuations = set(string.punctuation)
punctuations]
return filtered_tokens
The function takes a text parameter as input. This function is responsible for
tokenizing the given text into individual words and filtering out stop words
and punctuation.
characters.
The tokens are then filtered using a list comprehension. Each token is
checked if it is not in the set of stop words (after converting it to lowercase)
and not in the set of punctuation characters. The filtered tokens are stored in
the filtered_tokens list, which is returned by the tokenize function.
This line calls the tokenize function, passing the document as the input text.
The document variable contains the text of the document. The resulting
tokens are stored in the tokens variable.
[In]: print("Tokens:")
[In]: print(tokens)
368
Let’s take a look at the cleaned and tokenized output before moving on to the
vectorization process:
[Out]:
Tokens:
I am sure some bashers of Pens fans are pretty confused about the lack of
any kind of posts about the recent Pens massacre of the Devils. Actually, I
am bit puzzled too and a bit relieved. However, I am going to put an end to
non-PIttsburghers’ relief with a bit of praise for the Pens. Man, they are
killing those Devils worse than I thought. Jagr just showed you why he is
much better than his regular season stats. He is also a lot fo fun to watch in
the playoffs. Bowman should let JAgr have a lot of fun in the next couple of
games since the Pens are going to beat the pulp out of Jersey anyway. I was
very disappointed not to see the Islanders lose the final regular season game.
PENS RULE!!!
Upon comparing the original text and the generated tokens, we can see that
the tokenization process has split the text into individual words and removed
both punctuation marks and stop words.
from the document and computes the IDF values based on the term
single document, the vocabulary and IDF values will be learned from
that document.
vocabulary. The matrix contains the TF-IDF values for each term in
the document.
[In]: print(tfidf_matrix.toarray())
[Out]:
TF-IDF Matrix:
370
This TF-IDF matrix represents the TF-IDF values for the terms in the
document.
Each value in the matrix represents the TF-IDF score for the corresponding
term in the document. TF-IDF combines two measures: Term Frequency
(TF) and Inverse Document Frequency (IDF). TF measures the occurrence
of a term in the document, while IDF measures the rarity of the term across
all documents in the dataset.
The two libraries are similar in their overall structure as they both perform
text classification using Naive Bayes and handle preprocessing steps like
data cleaning, tokenization, and vectorization. One key difference is that
Scikit-Learn directly uses TfidfVectorizer to perform tokenization, stop word
removal, and TF-IDF feature extraction in a single step, while PySpark
utilizes separate transformers like Tokenizer, StopWordsRemover,
HashingTF, and IDF to achieve similar functionality.
371
In Scikit-Learn:
categories.
In PySpark:
1. Create a SparkSession.
punctuation.
372
This code sets up the necessary imports to work with the NLTK (Natural
Language Toolkit) and Scikit-Learn packages for text classification using the
20 Newsgroups dataset. We first import nltk—a widely used library for
working with human language data, such as text processing and
tokenization. We then import the fetch_20newsgroups function to download
and load the 20 Newsgroups dataset. Next, we import the TfidfVectorizer
class to convert text documents into numerical feature vectors using the TF-
IDF (Term Frequency-Inverse Document Frequency) representation. In the
next step, we import the MultinomialNB class, which is an algorithm
commonly used for text classification based on the Naive Bayes principle.
Moving on, we import the accuracy_
def tokenize(text):
tokens = nltk.word_tokenize(text)
string.punctuation]
return tokens
373
religion.misc']
shuffle=True, random_state=42)
In this code, the categories list contains the names of different topics or
categories that are part of the 20 Newsgroups dataset. Each category
represents a specific topic. The fetch_20newsgroups function is used to fetch
the 20 Newsgroups dataset. It takes the categories list as an argument to
specify which categories to include in the dataset. The subset=‘all’
parameter indicates that all documents from the specified categories should
be included. The remove=(‘headers’, ‘footers’, ‘quotes’) parameter specifies
to remove certain parts of the documents (headers, footers, and quotes) that
are not relevant for the task at hand. The shuffle=True parameter shuffles the
dataset randomly, and random_
374
tokenizer=tokenize)
The fit_transform method is used to fit the vectorizer on the training data
and transform it, while the transform method is used to apply the same
transformation to the test data. This allows the text data to be ready for use
in our machine learning model.
Only two lines of code are required to train the algorithm. In the first line, an
instance of the MultinomialNB class is created and assigned to the variable
clf. The MultinomialNB class implements the Multinomial Naive Bayes
algorithm. Naive Bayes is a probabilistic machine learning algorithm
commonly used for text classification tasks.
375
The fit() method trains the Multinomial Naive Bayes classifier by estimating
the probabilities of each class label based on the input features and the
corresponding target labels. After the fit() method is executed, the clf object
is trained and ready to make predictions on new, unseen data.
In this line of code, the trained classifier clf is used to predict the class labels
for the test data X_test_vectors. The predict() method of the classifier is
called, and it takes the test data as its argument. The X_test_vectors variable
represents the TF-IDF
vectors of the test data obtained from the TfidfVectorizer. Each row of
X_test_vectors corresponds to a document, and each column represents a
TF-IDF feature.
The predict() method applies the learned model on the test data and assigns a
predicted class label to each instance. It uses the trained classifier’s learned
parameters to calculate the posterior probability of each class given the input
features. Based on these probabilities, it predicts the most probable class
label for each instance. The resulting predicted class labels are stored in the
y_pred variable, which now holds the predicted labels for the test instances
based on the trained classifier.
actual_label = newsgroups_all.target_names[y_test[i]]
predicted_label = newsgroups_all.target_names[y_pred[i]]
print("Actual:", actual_label)
print("Predicted:", predicted_label)
print("---")
[Out]:
Actual : rec.sport.baseball
Predicted: rec.sport.baseball
376
---
Actual : sci.electronics
Predicted: comp.os.ms-windows.misc
---
Actual : sci.space
Predicted: sci.space
---
Actual : talk.politics.misc
Predicted: talk.politics.misc
---
Actual : alt.atheism
Predicted: sci.space
---
Accuracy: 0.70
actual label of the i-th test instance. The y_test array contains the
true labels of the test instances. The y_test[i] index is used to access the label
index for the i-th test instance, and newsgroups_all.target_
names is a list or array that maps label indices to their corresponding label
names. By indexing newsgroups_all.target_names with y_test[i],
actual_label.
the predicted label of the i-th test instance. The y_pred array contains the
predicted labels obtained from the classifier. Similar to the
previous step, the y_pred[i] index is used to access the predicted label index
for the i-th test instance, and newsgroups_all.target_names is
377
different instances.
Comparing the predicted labels with the true labels for the top five instances
in the preceding output, we can see that the model has made two mistakes.
The sci.electronics category is predicted as comp.os.ms-windows.misc, and
the alt.atheism category is predicted as sci.space.
[Out]: 0.70
In the first line of code, the accuracy_score function is used to calculate the
accuracy of the classifier’s predictions. The function takes two arguments:
(1) y_test, which represents the true labels or target values of the test
instances, and (2) y_pred, which represents the predicted labels obtained
from the classifier for the corresponding test instances. The function
compares the predicted labels (y_pred) with the true labels (y_test) and
computes the accuracy of the predictions. It calculates the fraction of
correctly predicted labels over the total number of instances. The resulting
accuracy value is assigned to the variable accuracy.
The second line prints the accuracy value calculated in the previous step.
The accuracy is displayed as a decimal value between 0 and 1, where 1
represents a perfect classification accuracy. The resulting accuracy value is
approximately 0.70, indicating that the Naive Bayes classifier’s predictions
match the true labels for about 70% of the test instances.
378
StopWordsRemover, VectorAssembler
Next, we import the NaiveBayes class for predicting class labels based on
input features. In the next step, we import the
MulticlassClassificationEvaluator class, which provides evaluation metrics
for the multiclass classification model. We also import the Pandas library for
data manipulation as well as the rand function (this will be used to generate a
random sample of actual and predicted class labels as part of the evaluation
step). Finally, we import the fetch_20newsgroups function to fetch the 20
Newsgroups dataset.
getOrCreate()
In this line of code, a SparkSession object named spark is created using the
builder() method, which allows configuration and customization of the
SparkSession. The appName() method is used to set the name of the Spark
application as “TextClassification”. If a SparkSession with the specified
name already exists, it will be retrieved; otherwise, a new SparkSession will
be created. The SparkSession serves as the entry point for interacting with
Spark and provides functionality for working with distributed data
structures, executing operations, and accessing various Spark functionalities.
379
politics.misc', 'talk.religion.misc']
This is the same as in step 3 of the Scikit-Learn code. Here, the categories
list contains the names of different topics or categories that are part of the 20
Newsgroups dataset. Each category represents a specific topic. The
fetch_20newsgroups function is used to fetch the 20 Newsgroups dataset. It
takes the categories list as an argument to specify which categories to
include in the dataset. The subset=‘all’ parameter indicates that all
documents from the specified categories should be included. The remove=
(‘headers’, ‘footers’, ‘quotes’) parameter specifies to remove certain parts of
the documents (headers, footers, and quotes) that are not relevant for the task
at hand. The shuffle=True parameter shuffles the dataset randomly, and
random_state=42 sets a seed for reproducibility.
Step 4: Create a Pandas DataFrame from the Scikit-Learn data and target
arrays
all.data, while the label column is populated with the data from
newsgroups_all.target.
Step 6: Tokenize the text and remove punctuation from the text column
pattern='\\W')
The second line applies the tokenizer to the pyspark_df DataFrame using the
transform() method. It takes the input DataFrame, pyspark_df, and performs
the tokenization process using the tokenizer object created earlier. The
resulting tokenized DataFrame is assigned to a new DataFrame called
tokenized_df, which contains the original columns from pyspark_df along
with a new tokens column that holds the tokenized version of the text.
We can take a look at the top five rows of the tokenized_df DataFrame:
[In]: tokenized_df.show(5)
[Out]:
+--------------------+-----+--------------------+
| text|label| tokens|
+--------------------+-----+--------------------+
+--------------------+-----+--------------------+
381
outputCol='filtered_tokens')
[In]: filtered_df = stopwords_remover.transform(tokenized_df)
The second line applies the stop words remover to the tokenized_df
DataFrame using the transform() method. It takes the input DataFrame,
tokenized_df, and performs the stop words removal process using the
stopwords_remover object created earlier. The resulting DataFrame named
filtered_df contains the original columns from tokenized_df along with a
new filtered_tokens column that holds the tokens after stop words removal.
Let’s take a look at the top five rows of the filtered_df DataFrame:
[In]: filtered_df.show(5)
[Out]:
+--------------------+-----+--------------------+--------------------+
+--------------------+-----+--------------------+--------------------+
+--------------------+-----+--------------------+--------------------+
only showing top 5 rows
382
outputCol='rawFeatures')
In this step, we transform the tokenized and filtered DataFrame into a TF-
IDF (Term Frequency-Inverse Document Frequency) representation. The
first line of code creates a HashingTF object named hashing_tf that takes the
filtered_tokens column as input and generates a sparse vector representation
called rawFeatures as output.
In the third line, an IDF (Inverse Document Frequency) object named idf is
created, specifying the rawFeatures column as input and the features column
as output. The IDF
measures the importance of each term in a document collection.
In the fourth line, the IDF transformation is fitted to the tf_data DataFrame
using the fit() method, creating an IDF model named idf_model. This step
calculates the IDF
Finally, the IDF model is applied to the tf_data DataFrame using the
transform() method. It generates a new DataFrame named tfidf_data that
includes the original columns from tf_data along with a new column called
features, which contains the TFIDF vector representation for each document.
Let’s take a look at the vectorized columns, rawFeatures and features, from
the tfidf_
data DataFrame. Starting with the rawFeatures column, the following code
retrieves the first row of the DataFrame using the first() method, selects the
rawFeatures column, and then accesses the first element [0] in that column:
383
[In]: tfidf_data.select('rawFeatures').first()[0]
[Out]:
SparseVector(262144, {2710: 1.0, 8538: 1.0, 11919: 1.0, 12454: 1.0, 23071:
1.0, 23087: 2.0, 31739: 2.0, 45916: 1.0, 51178: 2.0, 54961: 1.0, 55270: 2.0,
57448: 1.0, 58394: 2.0, 76764: 1.0, 77751: 1.0, 79132: 1.0, 82237: 1.0,
84685: 1.0, 91431: 1.0, 92726: 1.0, 96760: 1.0, 102382: 2.0, 105901: 1.0,
107855: 1.0, 109753: 1.0, 109906: 1.0, 116581: 1.0, 127337: 1.0,
132975: 1.0, 134125: 1.0, 136496: 1.0, 137765: 3.0, 138895: 1.0, 142239:
1.0, 142343: 1.0, 143478: 5.0, 147136: 1.0, 153969: 1.0, 156917: 1.0,
158102: 1.0, 169300: 1.0, 173339: 1.0, 174802: 1.0, 181001: 1.0, 194361:
1.0, 194710: 1.0, 196689: 1.0, 196839: 1.0, 208344: 1.0, 216372: 1.0,
217435: 1.0, 223985: 1.0, 229604: 1.0, 233967: 1.0, 235375: 1.0, 245599:
2.0, 245973: 1.0, 256965: 1.0})
The values in the SparseVector represent the term frequencies (TF) of each
term in the document, before applying the IDF transformation. Each value in
the vector represents the TF of a specific term at the corresponding index in
the vector. The TF is a count of how many times a term appears in the
document. For example, if we have an entry {2710: 1.0, 8538: 1.0}, it means
that the term at index 2710 appears once in the document, and the term at
index 8538 also appears once.
The rawFeatures column provides the raw term frequencies, which can be
useful for certain tasks or analysis. However, TF alone does not consider the
importance of a term across the entire corpus. This is why the TF-IDF
transformation is applied to obtain the features column, which incorporates
both the term frequencies and inverse document frequencies to better
represent the importance of terms in the document within the context of the
entire corpus.
[In]: tfidf_data.select('features').first()[0]
[Out]:
76764: 2.0493, 77751: 2.8383, 79132: 3.1372, 82237: 4.4781, 84685: 3.4,
91431: 4.3803, 92726: 3.5055, 96760: 4.613, 102382: 5.1199, 105901:
5.5003, 107855: 7.2792, 109753: 3.306, 109906: 6.586, 116581: 4.6401,
127337:
384
Chapter 14 Natural laNguage proCessiNg with paNdas, sCikit-learN, aNd
pyspark 5.4621, 132975: 2.7923, 134125: 2.5627, 136496: 6.4429, 137765:
8.6343, 138895: 3.2621, 142239: 2.6709, 142343: 3.0585, 143478: 27.3733,
147136: 1.7614, 153969: 4.2606, 156917: 2.972, 158102: 6.4429, 169300:
3.6416,
In the context of the features column in tfidf_data, each value represents the
TFIDF weight for a specific term at the corresponding index in the vector. A
higher TFIDF weight indicates that the term is more important or distinctive
in the document compared to the rest of the corpus. For example, let’s
consider the entry {2710: 6.1064, 8538: 2.1065}. This means that the term at
index 2710 has a TF-IDF weight of 6.1064, and the term at index 8538 has a
TF-IDF weight of 2.1065. These weights indicate the relative importance of
these terms within the document, with a higher weight suggesting greater
significance.
outputCol='vectorized_features')
Step 10: Split the dataset into train and test sets
seed=42)
In this line of code, the dataset assembled_data is split into two separate sets,
a training set and a test set, using the randomSplit function. This takes two
arguments: the first argument is a list of proportions that determines the
relative sizes of the resulting 385
The resulting splits are assigned to the variables train_data and test_data,
representing the training dataset and the test dataset, respectively. These sets
can be used for training and evaluating the machine learning model.
In this code, a Naive Bayes classifier is trained using the training data. First,
an instance of the NaiveBayes class is created with the features column
specified as vectorized_features and the label column specified as label. This
configuration determines which columns of the dataset will be used as
features (input variables) and labels (output variables) for training the
classifier.
Next, the fit method is called on the NaiveBayes instance, passing the
training data (train_data) as an argument. This step trains the Naive Bayes
classifier using the labeled training examples. The model learns the
statistical properties of the input features and their corresponding labels. The
resulting trained model is assigned to the variable model and can be used to
make predictions on new, unseen data.
In this line of code, the trained Naive Bayes model is used to make
predictions on the test dataset. The transform method is called on the trained
model, passing the test dataset (test_data) as an argument. This step applies
the trained model to the test data and generates predictions for each example
in the test dataset.
386
target_names)}
map[label])
[In]: predictions = predictions.withColumn("label_name", map_
label(predictions.label))
label(predictions.prediction))
name").orderBy(rand()).limit(5)
[In]: selected_predictions.show(truncate=False)
[Out]:
+----------------------+----------------------+
|label_name |prediction_name |
+----------------------+----------------------+
|soc.religion.christian|soc.religion.christian|
|comp.graphics |comp.graphics |
|sci.space |sci.med |
|alt.atheism |alt.atheism |
|sci.med |sci.med |
+----------------------+----------------------+
In the second step, the predictions DataFrame is being used to select five
random rows containing the actual and predicted label names. The select
method is used to specify the columns to be included in the resulting
DataFrame, which are label_name and prediction_name columns. The
orderBy method with the rand() function is used to randomize the order of
rows in the DataFrame. The limit method is used to limit the number of rows
to be selected, which in this case is set to 5. Finally, the show method is
called to display the selected rows from the DataFrame. The truncate=False
parameter ensures that the full contents of the columns are displayed without
truncation.
predictionCol='prediction', metricName='accuracy')
388
[Out]: 0.71
The preceding line is used to display the value of the accuracy variable. It
outputs the string “Accuracy:” followed by the value stored in the accuracy
variable. This value is approximately 0.71, which means that the model’s
predictions were correct for approximately 71% of the test data instances.
This value represents the proportion of correctly classified instances out of
the total number of instances in the test dataset.
Scikit-Learn
Step 2: Tokenization
tokens = nltk.word_tokenize(text)
string.punctuation]
return tokens
389
politics.misc', 'talk.religion.misc']
size=0.2, random_state=42)
Step 5: Create TF-IDF vectors from the text data with tokenization
tokenizer=tokenize)
Step 8: Print the top five rows of actual vs. predicted labels
actual_label = newsgroups_all.target_names[y_test[i]]
predicted_label = newsgroups_all.target_names[y_pred[i]]
390
print("Predicted:", predicted_label)
print("---")
PySpark
getOrCreate()
religion.misc']
shuffle=True, random_state=42)
391
Step 6: Tokenize the text plus remove punctuation from the text column
pattern='\\W')
outputCol='filtered_tokens')
outputCol='rawFeatures')
outputCol='vectorized_features')
Step 10: Split the dataset into train and test sets
seed=42)
392
target_names)}
map[label])
label(predictions.prediction))
Step 14: Select five random rows of actual vs. predicted labels
name").orderBy(rand()).limit(10)
[In]: selected_predictions.show(truncate=False)
predictionCol='prediction', metricName='accuracy')
393
In the next chapter, we will delve into the process of building, training, and
evaluating a k-means clustering algorithm for effective data segmentation.
Clustering is a commonly used technique in segmentation analysis, grouping
similar observations together based on their characteristics or proximity in
the feature space. The result is a set of clusters, with each observation
assigned to a specific cluster.
394
CHAPTER 15
Pandas, Scikit-Learn,
and PySpark
Clustering differs from supervised learning in at least two key areas. First,
the input into a clustering algorithm is unlabeled. In supervised learning,
such as regression and classification, we predict the label. However, in
cluster analysis, there is no label to predict. Instead, we group similar
observations together based on their inherent characteristics or patterns.
The Dataset
For this chapter, we use the open source Iris dataset to implement a k-means
clustering algorithm. We have already explored this dataset in Chapter 8.
The dataset contains measurements of four features (sepal length, sepal
width, petal length, and petal width) from three different species of Iris
flowers (setosa, versicolor, and virginica).
Step 3: Access the feature data (X) and target labels (y)
[In]: X = iris.data
[In]: y = iris.target
[In]: pandas_df['target'] = y
dict(zip(range(len(target_names)), target_names)))
[In]: print(pandas_df.head())
[Out]:
5.1
3.5
1.4
0.2
4.9
3
1.4
0.2
4.7
3.2
1.3
0.2
4.6
3.1
1.5
0.2
3.6
1.4
0.2
To get the output, we loaded the Iris dataset into a Pandas DataFrame. First,
we import the necessary libraries, including load_iris from Scikit-Learn (to
load the Iris dataset) and Pandas (for DataFrame handling). We then use the
load_iris() method to load the Iris dataset, which returns an object containing
feature data (X) and target labels (y). We define column names for the
DataFrame using the feature_names and target_names attributes. Next, we
create a Pandas DataFrame (pd.DataFrame()) with the feature data, adding
target and target_names columns using the map() method and a dictionary.
Finally, we print the first few rows of the DataFrame to inspect the loaded
data’s structure and content using the head() method.
One important feature of this dataset is that all features (sepal length, sepal
width, petal length, and petal width) are on the same scale (centimeters).
This suggests that scaling will not be necessary for applying the k-means
model to the Iris dataset.
397
[In]: spark_df.show(5)
[Out]:
5.1
3.5
1.4
0.2
4.9
1.4
0.2
4.7
3.2
1.3
0.2
4.6
3.1
1.5
0.2
3.6
1.4
0.2
In the first step, we import the SparkSession class, which is necessary for
creating Spark applications. In the second step, we create a Spark object
named spark using the SparkSession.builder.getOrCreate() method. This
allows us to access the Spark functionality. In the third step, we convert
pandas_df DataFrame into spark_df DataFrame by utilizing the
createDataFrame() method. This transformation enables us to leverage the
distributed computing capabilities of Spark for large-scale data analysis.
In the final step, we display the first five rows of the newly created PySpark
DataFrame using the show() method.
398
[In]: print(pandas_df.shape)
[Out]: (150, 6)
We can achieve the same results with the following PySpark code:
[Out]: (150, 6)
In this PySpark code, we print the shape of the Spark DataFrame by utilizing
the count() method to retrieve the number of rows and the len() function to
determine the number of columns.
Let’s count the number of species within each target variable using the
Pandas value_counts() method:
[In]: print(species_count)
[Out]:
setosa 50
versicolor 50
virginica 50
[In]: species_count.show()
[Out]:
+------------+-----+
|target_names|count|
+------------+-----+
| setosa| 50|
| versicolor| 50|
| virginica| 50|
+------------+-----+
399
Chapter 15 k-Means Clustering with pandas, sCikit-learn, and pyspark This
PySpark code performs a groupBy operation on the spark_df using the
column target_names. By applying the groupBy() method, the data is
grouped based on unique values in the target_names column. The count()
method is then applied to the grouped DataFrame to calculate the number of
occurrences for each flower species.
We can see from the output of both Pandas and PySpark that each target has
50
species. This suggests that an accurate k-means algorithm should group the
Iris dataset into three clusters, each with 50 observations. We will see if this
is the case in the next section.
We know from the data analysis in the Dataset section that there are three
clusters (setosa, versicolor, and virginica), so if our clustering algorithm is
100% accurate, it should group the data into those clusters, each having 50
species.
400
[In]: X = iris.data
This code loads the Iris dataset and assigns it to the variable iris. The dataset
is then assigned to the variable X, which will be used to store the feature
data of the dataset. The code then defines column names for the feature data
of the Iris dataset and creates a DataFrame using the Pandas library.
[In]: sse_values = []
sse_values.append(kmeans.inertia_)
The algorithm needs a specific value of k to start with. We know that the
correct value is 3 since we know that there are three classes (setosa,
versicolor, virginica), but in many real-world applications, we wouldn’t
know the number of efficient groups or clusters.
The elbow method is a widely used technique for determining the optimal
number of clusters in a dataset. It involves selecting different values of k and
training corresponding clustering models. For each model, the Sum of
Squared Errors (SSE) or sum of within-cluster distances of centroids is
computed and plotted against k.
The optimal k is determined at the point where the SSE shows a significant
decrease, resembling the shape of an elbow. This indicates that additional
clusters beyond that point do not contribute significantly to improving the
clustering quality.
The code provided previously calculates the SSE values for different values
of k using the k-means clustering algorithm. The following is what the code
does line by line:
401
• for k in k_values starts a loop that iterates over each value of k in the
k_values range.
• kmeans = KMeans(n_clusters=k, random_state=42) creates an
the results.
• kmeans.fit(X) fits the k-means model to the feature data (X) using
the fit() method. The k-means algorithm will partition the data into k
how far the samples within each cluster are from the centroid of that
After executing these lines of code, the sse_values list will contain the SSE
values for each value of k specified in the k_values range. These values can
be used to generate a plot to decide on the optimum number of clusters by
identifying the elbow point.
[In]: plt.ylabel('Inertia')
[In]: plt.show()
[Out]:
402
Let’s first explain the code and then the plot. The code is used to plot the
SSE values against the number of clusters (k) in order to identify the elbow
point, which helps determine the optimal number of clusters for the dataset.
The following is a breakdown of what the code does:
(sse_values). The ‘bx-’ argument specifies the line style and marker
style for the plot. In this case, ‘bx-’ represents blue markers (‘b’)
• plt.ylabel(‘SSE’) sets the label for the y axis of the plot to “SSE” using the
ylabel() function.
• plt.title(‘The Elbow Method’) sets the title of the plot to “The Elbow
Method” using the title() function.
• plt.show() displays the plot on the screen using the show() function.
The plot indicates that the elbow point, at which the number of clusters is
optimal, is k=3, as this is the point located at the bend in the plot.
403
[In]: kmeans.fit(X)
This code is used to build and train a k-means clustering model on the
dataset X with a number of clusters k = 3. The first line creates an instance
of the KMeans class from the Scikit-Learn library. The n_clusters parameter
is set to the desired number of clusters k, indicating how many clusters the
k-means algorithm should aim to create (in this case, 3). The random_state
parameter is set to 42 to ensure reproducibility of the results.
The second line fits the k-means model to the data X using the fit() method.
The algorithm will assign each data point to one of the k clusters based on
their similarity and iteratively adjust the cluster centroids to minimize the
within-cluster sum of squares.
After executing these lines of code, the kmeans object will contain the
trained kmeans clustering model, which will have learned the cluster
assignments and the final centroids for the specified number of clusters.
Step 6: Evaluate the clustering model
The second line calculates the Silhouette Score using the silhouette_score()
function. This takes two parameters: X, which is the feature data used for
clustering, and labels, which are the cluster labels assigned by the clustering
model to each data point.
The function returns a single value representing the average Silhouette Score
for all data points. The last line prints the score, which is approximately
0.55.
The following code calculates and prints the predicted cluster counts:
[In]: cluster_counts = {}
if label in cluster_counts:
cluster_counts[label] += 1
else:
cluster_counts[label] = 1
We observe from this output that cluster 1 has 50 counts, as expected, while
the counts for the other two clusters deviate from 50, either higher (cluster 0:
62) or lower (cluster 2: 38).
[In]: plt.show()
In this code, we are using the seaborn library to create a pair plot for
visualizing relationships between pairs of features in the Iris dataset, with a
focus on how the data points are clustered according to a previously fitted k-
means clustering model (kmeans).
The first line imports the seaborn library and names it sns. The next step
adds cluster labels to the pandas_df DataFrame. This is done by assigning
the cluster labels obtained from the k-means model to a new column in the
DataFrame named Cluster.
405
• palette=’viridis’: This sets the color palette to the viridis color scheme.
kernel density estimate (KDE) plots along the diagonal cells, showing
• markers=[‘o’, ‘s’, ‘D’]: This parameter defines the markers used for data
points in each cluster. Here, ‘o’, ‘s’, and ‘D’ represent different marker
shapes for clarity.
• plt.suptitle(‘Pair Plot of Features with Cluster Assignments’): This line sets
the super title for the pair plot, providing a descriptive title for the
visualization.
• plt.show(): Finally, this command displays the pair plot with all the
specified visualizations.
406
and 2 have 24 data points (62-38) that are misclassified due to the overlap.
407
It’s now time to construct a k-means clustering algorithm using the PySpark
ML library.
The following are the steps to build, train, and evaluate the algorithm using
the Iris dataset:
This code imports various libraries and modules necessary for performing k-
means clustering on the Iris dataset using PySpark:
functionality.
algorithm in PySpark.
Silhouette Score.
[In]: X = iris.data
408
The following code defines the input features, creates a vector column, and
transforms a Spark DataFrame.
outputCol="features")
[In]: k = 3
409
The first line creates a KMeans object and assigns it to the variable kmeans.
The setK() method is called on kmeans and sets the number of clusters to k.
The value of k represents the desired number of clusters in the k-means
model. The setSeed() method is called to set the random seed for
reproducibility.
The second line fits the k-means model to the spark_df DataFrame. The fit()
method is called on the kmeans object, passing spark_df as the argument.
This trains the k-means model on the provided DataFrame. The resulting
trained model is assigned to the variable model.
This code performs predictions using the trained k-means clustering model
on the Spark DataFrame. The transform() method is called on the trained k-
means model (model) with the spark_df DataFrame as the argument. This
applies the trained model to the spark_df DataFrame and generates
predictions based on the input data. The resulting predictions, along with the
original columns of the DataFrame, are stored in a new DataFrame called
predictions.
This code calculates and prints the Silhouette Score for the clustering results
obtained from the k-means model. The first line creates an instance of the
ClusteringEvaluator, which is a class in PySpark that provides methods to
evaluate the quality of clustering results. The created object is assigned to
the variable evaluator.
410
Step 9: Count and print the occurrences for each predicted cluster
[In]: cluster_counts.show()
[Out]:
+----------+-----+
|prediction|count|
+----------+-----+
| 0| 50|
| 1| 38|
| 2| 62|
+----------+-----+
411
Here, we consolidate all the relevant code from the previous steps into a
single code block. This way, the reader can execute the code as a single
block in both Scikit-Learn and PySpark.
Scikit-Learn
[In]: X = iris.data
412
[In]: sse_values = []
kmeans.fit(X)
sse_values.append(kmeans.inertia_)
[In]: plt.ylabel('SSE')
[In]: plt.show()
[In]: k = 3
[In]: kmeans.fit(X)
413
[In]: cluster_counts = {}
if label in cluster_counts:
cluster_counts[label] += 1
else:
cluster_counts[label] = 1
PySpark
Step 1: Import necessary libraries
[In]: X = iris.data
414
outputCol="features")
[In]: k = 3
[In]: cluster_counts.show()
415
416
CHAPTER 16
Hyperparameter Tuning
PySpark
Examples of Hyperparameters
417
Model
Scikit-Learn
PySpark
Description
Decision trees
max_depth
maxDepth
split
internal node
min_samples_ minInfoGain
leaf
leaf node
max_features
maxBins
best split
Random forests
n_estimators
numTrees
Number of decision trees in the
random forest
max_depth
maxDepth
forest
split
internal node
min_samples_ minInfoGain
leaf
leaf node
max_features
Gradient-boosted n_estimators
numTrees
trees
boosting process
learning_rate
stepSize
prediction
max_depth
maxDepth
split
internal node
min_samples_ minInfoGain
leaf
leaf node
( continued)
418
Scikit-Learn
PySpark
Description
Neural networks
hidden_layers
numLayers
network
hidden_units
layerSizes
learning_rate
stepSize
during training
k-Means
n_clusters
clustering
max_iter
maxIterations
(min_samples_split / minInstancesPerNode)
minInfoGain)
We use the same Iris dataset we used in Chapter 9 to classify the Iris flowers
into different species. By leveraging hyperparameter tuning techniques in
random forests, we aim to build a highly accurate classification model that
can effectively distinguish between the three species of Iris flowers: setosa,
versicolor, and virginica.
419
ChaPTeR 16 hyPeRPaRaMeTeR TuNING wITh SCIkIT-LeaRN aND
PySPaRk Hyperparameter Tuning in Scikit-Learn
We start the process by importing the necessary libraries. These are required
to provide the essential functionality and tools necessary for our analysis.
This line loads the Iris dataset using the load_iris() function.
[In]: X = iris_sklearn.data
[In]: y = iris_sklearn.target
This code separates the features and labels from the loaded dataset.
420
size=0.2, random_state=42)
In this step, we split the data into training and testing sets, with 80% for
training and 20% for testing.
[In]: rf = RandomForestClassifier(random_state=42)
This line defines an instance of the random forest classifier. The random
seed ensures reproducibility of the results.
[In]: param_grid = {
In this code, we define the param_grid so that the grid search algorithm will
systematically evaluate various combinations of these hyperparameters to
find the best combination that results in the highest model performance.
Given the relatively small size of the Iris dataset (150 records), we have
made conservative parameter choices. This is to prevent overfitting and to
ensure that the model does not become overly complex for the given dataset
size.
For reference, the default parameters for the Scikit-Learn random forest
classifier are as follows:
• n_estimators: 100
• min_samples_split: 2
• min_samples_leaf: 1
421
consider four values for max_depth: 2, 5, 7, and 10. The model will be
trained and evaluated with each of these values to determine the best
max_depth value.
• 'n_estimators': [20, 50, 75, 100] defines the n_estimators
the random forest. The grid search will explore three values for n_
samples required to split an internal node in a decision tree. The grid search
will evaluate the model with four different values for min_
forest. The grid search will assess the model’s performance using four
cv=5)
In these two lines of code, the grid search is performed using the
GridSearchCV class from the Scikit-Learn library. Here is what the code
does:
• The first line creates an instance of GridSearchCV and initializes it with
the following three parameters:
estimator that will be used. In this case, it is set to rf, which is an instance of
the RandomForestClassifier.
422
and their possible values that will be explored during the grid
dictionary.
will be split into five folds, and the grid search will perform
of the data while still having multiple validation sets for robust
evaluation.
• The second line of code initiates the grid search process. The fit() method
of the GridSearchCV object is called with the training data
X_train and the corresponding labels y_train. The grid search algorithm then
performs the search, training and evaluating the model with
423
Step 9: Make predictions on the test data using the best model
In this line of code, we use the best model to predict the labels for the test
data.
424
[Out]: 1.0
This step evaluates the model’s performance using the accuracy metric. The
first line calculates the accuracy of the model by comparing the predicted
labels with the true labels. The second line prints the accuracy of the model
on the test data.
[Out]:
Best max_depth: 5
Best n_estimators: 50
Best min_samples_split: 3
Best min_samples_leaf: 1
In this final step, after obtaining the best model and its corresponding
hyperparameters, the code prints the values of these hyperparameters.
425
the test dataset, meaning that 100% of the samples were correctly
predicted.
• Best max_depth: The optimal value for the maximum depth of each
forest classifier is 3.
MulticlassClassificationEvaluator
426
In this first step, we import the necessary libraries. Here is what each import
does:
classifier in PySpark.
This code simply loads the Iris dataset using the load_iris() function.
[In]: X = iris_data.data
[In]: y = iris_data.target
427
columns=iris_data.feature_names)
[In]: iris_pandas["label"] = y
This code creates a Pandas DataFrame named iris_pandas from the Iris
dataset, which has been loaded into the variables X (feature data) and y
(target labels) in step 3.
More precisely, the first line creates a new DataFrame called iris_pandas
using the Pandas DataFrame constructor. It takes two arguments: data and
columns. The data argument (X) is the feature data array, and the columns
argument specifies the column names for the DataFrame, which are
extracted from the feature_names attribute of the iris_data object.
The second line adds a new column named label to the iris_pandas
DataFrame and assigns it the values from the y array (which represents the
target labels of the Iris dataset).
Step 7: Split the data into features and labels and assemble the features
outputCol="features")
428
The last line applies the transform method of the assembler object to the
PySpark iris_df DataFrame. The method takes the input DataFrame and adds
a new column named features to it. This new column contains the combined
vector representation of the input features specified by inputCols.
This code splits the data in the iris_df into training and testing sets. The
randomSplit function takes two arguments. The first argument, [0.8, 0.2],
represents the proportions for splitting the data: 80% of the data will be
allocated to the training set, and 20% will be allocated to the test set. The
second argument, seed=42, sets the random seed to ensure reproducibility of
the splits.
The result is a list containing two PySpark DataFrames: X_train and X_test.
The X_train DataFrame contains 80% of the data, randomly sampled, which
will be used for training the classifier. The X_test DataFrame contains the
remaining 20% of the data, which will be used for evaluating the
performance of the trained model on unseen data.
Step 9: Define the random forest classifier
[In]: rf = RandomForestClassifier(featuresCol="features",
labelCol="label")
429
.build()
This code defines a parameter grid for tuning the hyperparameters of the
random forest classifier. By defining this grid, the code sets up a range of
values for each hyperparameter that will be tested during the hyperparameter
tuning process. The grid builder allows for systematically evaluating
different combinations of hyperparameters to find the optimal configuration
for the random forest classifier.
For reference, the default hyperparameters for the PySpark random forest
classifier are as follows:
• maxDepth: 5
• numTrees: 20
• minInstancesPerNode: 1
• minInfoGain: 0.0
Now, let’s explain the parameter grid we’ve set up to customize the default
values:
hyperparameter combinations.
430
values [20, 50, 75, 100] to be tested for the numTrees parameter.
happen in the decision trees. A higher value implies stricter splitting criteria,
and a lower value allows for more splits, potentially leading to a more
complex model. Notice how PySpark uses decimals for
two platforms, which may impact accuracy and the best parameter
combinations.
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
OceanofPDF.com
estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
431
following parameters:
432
• The first line assigns the best model found during the cross-validation
process to the variable best_model. The cv_model object contains
random forest.
criterion for splitting, while a lower value allows for more splits,
Step 14: Make predictions on the test data using the best model
433
The code applies the transform method to the best_model object, passing in
the test data (X_test). This method applies the trained model to the test data
and generates predictions, which are stored in the predictions object.
This code calculates the accuracy of the predictions made by the model on
the test data. This is done by comparing the predicted labels to the true
labels in the test data.
The resulting accuracy value is assigned to the variable accuracy, which can
be used to assess the performance of the model on the test data.
Step 16: Print the test accuracy and best parameter values
[Out]:
Best maxDepth: 5
Best numTrees: 50
Best minInstancesPerNode: 5
In this step, the code prints out various metrics and hyperparameters related
to the trained model and its performance on the test data. This provides an
overview of the model’s performance on the test data and highlights the
optimal settings for the hyperparameters that were tuned during the training
process.
434
well the model’s predictions align with the true labels in the test data.
• Best maxDepth (5): This reveals the best value found for the
• Best numTrees (50): This indicates the best number of trees chosen
for the random forest classifier. The value 50 indicates that the
trees. The value 5 indicates that the optimal setting for the minimum
instances per node is 5, meaning that a node will only be split further if it
contains at least 5 instances.
435
• Best minInfoGain (0.0): This indicates the best value for the
Finally, it’s worth noting that the results from PySpark differ slightly from
those of Scikit-Learn. Both algorithms agreed on the Best maxDepth and
Best numTrees, which were 5
This will allow the reader to execute the code as a cohesive unit in both
Scikit-Learn and PySpark.
Scikit-Learn
[In]: X = iris_sklearn.data
[In]: y = iris_sklearn.target
436
size=0.2, random_state=42)
[In]: param_grid = {
cv=5)
Step 9: Make predictions on the test data using the best model
437
PySpark
MulticlassClassificationEvaluator
[In]: X = iris_data.data
[In]: y = iris_data.target
columns=iris_data.feature_names)
[In]: iris_pandas["label"] = y
438
Step 7: Split the data into features and labels and assemble the features
outputCol="features")
[In]: rf = RandomForestClassifier(featuresCol="features",
labelCol="label")
.build()
[In]: evaluator =
MulticlassClassificationEvaluator(metricName="accuracy")
439
Step 14: Make predictions on the test data using the best model
Summary
In this chapter, we investigated an important topic in machine learning:
hyperparameter tuning. This is a critical step in model building that
involves finding the optimal set of hyperparameters for a given algorithm.
We also provided examples of such parameters in both Scikit-Learn and
PySpark and delved into a detailed example of how to fine-tune key
hyperparameters in random forests using the Iris dataset.
440
CHAPTER 17
441
In the final part of the chapter, we shift our focus to PySpark, allowing data
scientists planning to migrate from Scikit-Learn to reap the benefits of a
powerful distributed data processing framework. PySpark offers a scalable
and efficient solution for working with large-scale datasets. We illustrate
how to build a regression pipeline using PySpark’s MLlib library. Our
pipeline incorporates the VectorAssembler to assemble input features, the
StandardScaler for feature scaling, and the LinearRegression class for
regression modeling. We explore the process of fitting the pipeline to the
training data, making predictions on test data, and evaluating the model’s
performance.
By the end of this chapter, the reader will have a solid understanding of how
to leverage pipeline techniques to streamline their machine learning
workflow in these two widely used frameworks.
443
The following code imports several functions and classes necessary for the
pipeline process:
training and testing subsets. It randomly divides the data into two
portions: one for training the regression model and the other for
deviation of 1.
444
[In]: np.random.seed(0)
[In]: y = np.random.rand(num_samples) * 20 - 10
In this step, we generate some random data to illustrate the pipeline process.
The first line sets the seed value for the random number generator in
NumPy. By setting a specific seed, such as 0, we ensure that the same
sequence of random numbers will be generated each time we run the code.
The second line assigns the value 1000 to the variable num_samples. This
represents the number of samples (data points) we want to generate.
[In]: print(X[0:5])
[Out]:
[[5.48813504 7.15189366]
[6.02763376 5.44883183]
[4.23654799 6.45894113]
[4.37587211 8.91773001]
[9.63662761 3.83441519]]
[In]: print(y[0:5])
445
In the code that generates the preceding output, the X[0:5] line slices the X
variable to retrieve the rows from index 0 to index 4 (inclusive), which
correspond to the first five rows of X. The resulting rows are then printed,
displaying the top five rows of X. Similarly, the y[0:5] slices the y variable
to retrieve the elements from index 0 to index 4 (inclusive), which
correspond to the first five elements of y. The resulting elements are then
printed, displaying the top five elements of y.
The following code uses the train_test_split function to split the dataset into
training and testing subsets:
test_size=0.2, random_state=0)
parameter specifies that 20% of the data will be reserved for testing, while
the remaining 80% will be used for training the machine learning model.
The random_state=0
train will contain the training data, while X_test and y_test will contain the
testing data.
This is an important step in the pipeline process. In the following code, two
stages of the machine learning pipeline are defined (we could define any
number of stages):
The first line creates an instance of the StandardScaler class and assigns it
to the variable scaler. This is a data preprocessing step used to standardize
numerical features.
It scales the features by subtracting the mean and dividing by the standard
deviation, resulting in transformed features with zero mean and unit
variance.
446
In a typical pipeline, as shown in the next step, these stages are combined
together using the Pipeline class, where each stage is defined as a tuple
consisting of a name (optional) and the corresponding object.
regression)])
This line of code creates an instance of the Pipeline class and assigns it to
the variable pipeline. This class allows us to define a sequence of data
processing steps and an estimator as a single unit.
2) (‘regression’, regression): This defines the name regression for the step
and assigns the regression object to it. This corresponds to
It is important to note that the order of the tuples in the list determines the
order in which the steps will be executed.
In this step, we fit the previously defined pipeline to the training data:
[In]: pipeline.fit(X_train, y_train)
This line of code calls the fit method on the pipeline object, which fits the
pipeline to the training data. The X_train and y_train variables represent the
training data used for fitting the pipeline. X_train contains the input
features, and y_train contains the corresponding target variable or labels.
When we call fit on the pipeline, it applies the StandardScaler to scale the
features and then fits the LinearRegression model to the transformed data.
447
This step extracts the coefficients and the intercept from the fitted pipeline’s
regression model:
[Out]:
Intercept: 0.0938
The first line retrieves the coefficients of the regression model that is part of
the pipeline. The named_steps attribute of the pipeline allows accessing
individual steps by their specific names. In this case, regression is the name
assigned to the LinearRegression step. By using .coef_ on the
LinearRegression object, we obtain the weights or parameters of the model.
The second line retrieves the intercept (or bias) term of the regression
model. Similar to the previous line, it accesses the LinearRegression step
using the name regression and retrieves the intercept value using
.intercept_.
The last two statements print the coefficients and the intercept. The
coefficients represent the weights assigned to each feature in the linear
regression model, indicating the contribution of each feature to the
predicted outcome. The intercept term represents the baseline prediction
when all features have a value of zero.
The following code performs predictions on the test data using the fitted
pipeline:
This line uses the predict method of the fitted pipeline object to make
predictions on the test data X_test. This method applies the transformations
defined in the pipeline’s steps to the input data and then uses the trained
model to generate predictions. The X_test variable represents the test data,
which contains the input features on which we want to make predictions.
448
In this step, we want to take a look at the top five actual values and
corresponding predicted values for the target variable:
[In]: df = pd.DataFrame(data)
[In]: print(df.head())
[Out]:
Actual
Predicted
7.59
0.03
8.62
-0.07
6.43
0.02
3
7.07
0.12
-1.04
0.19
The first line of the code imports Pandas as pd. In the second line, a Python
dictionary called data is created, with the keys representing the names of the
columns we want to create in the DataFrame. The values are obtained by
slicing the first five elements from y_test and y_pred, which contain the
actual and predicted values from the machine learning model. In the next
line of code, the Pandas DataFrame df is created by passing the data
dictionary as an argument to the DataFrame() constructor. Finally, the
head() method displays the first few rows of the DataFrame.
We can see from the output that there is a big difference between the actual
and predicted values, indicating that the performance of the model isn’t that
great. This is fine for the task at hand since our simulated data is used
purely for illustration purposes.
449
The first line of the code calculates the RMSE by using the
mean_squared_error function. This takes two arguments: the actual target
values (y_test) and the predicted target values (y_pred). It computes the
mean squared error between the two sets of values. The np.sqrt function is
then applied to the mean squared error to obtain the RMSE. This is a
commonly used metric to measure the average difference between the
predicted values and the actual values. It represents the square root of the
average of the squared differences. The resulting RMSE value is assigned to
the variable rmse.
The next line prints the computed RMSE on the test data, which is
approximately 5.55. This value provides a measure of the model’s accuracy
or goodness of fit. A lower RMSE indicates better predictive performance,
as it represents a smaller average difference between the predicted and
actual values.
The first step in the pipeline process is to import the necessary classes from
the PySpark library:
450
structured way.
their values.
variance. It transforms the feature values such that they have zero
This line of code creates a Spark Session to enable us to interact with Spark.
451
"id as id",
"(RAND() * 5) as feature2",
"(RAND() * 20 - 10) as label")
[Out]:
id
feature1
feature2
label
3.51
4.01
-2.85
1.97
4.49
2.36
8.14
3.70
-6.81
7.40
2.95
0.93
3.64
1.12
-3.44
In the preceding code, we first assign the value 1000 to the variable
num_samples, which represents the number of rows or samples that we
want to generate for our simulated dataset. We then create a DataFrame
called simulated_data using the spark.range() function. This generates a
DataFrame with a single column named id containing numbers from 0 to
num_samples-1. This acts as an index column for the simulated data.
The selectExpr() method in the second line is used to select and transform
columns in the DataFrame. It takes multiple arguments, where each
argument specifies a transformation expression to be applied to a column:
between 0 and 10 for each row and assigns it to a new column named
452
between 0 and 5 for each row and assigns it to a new column named
feature2. Similar to the previous expression, it uses the RAND()
it to 0–5.
value between -10 and 10 for each row and assigns it to a new column
(random values between 0 and 5), and label (random values between -10
and 10).
The last line performs the actual split of the data into training and testing
sets. The randomSplit() function takes two arguments: an array of ratios for
the training and testing sets and a seed parameter to ensure that the random
splitting of the data is reproducible.
In this step, the pipeline stages for the machine learning workflow are
defined using the PySpark ML library:
453
inputCol="features", outputCol="scaledFeatures")
featuresCol="ScaledFeatures", labelCol="label")
In this code, the actual pipeline is created using the Pipeline class from
PySpark. It is no coincidence that PySpark utilizes the same name and
concept as Scikit-Learn, as it draws inspiration from Scikit-Learn’s pipeline
implementation.
• The assembler stage is responsible for assembling the input features into a
vector. It takes the columns feature1 and feature2 as input
454
Chapter 17 pipelines with sCikit-learn and pyspark
vector column obtained from the previous stage. It takes the feature
The pipeline_stages list in the code contains the stages in the order in which
they will be executed. The Pipeline object is created with pipeline_stages as
the input, representing the entire pipeline.
In this single line of code, the pipeline is fitted to the training data using the
fit() method. Fitting the pipeline means executing each stage of the pipeline
in order, applying the defined transformations, and training the specified
model on the training data.
The fit() method takes the training data as input and applies each stage’s
transformation sequentially, starting from the first stage (assembler) and
ending with the last stage (regression). The output of each stage is passed as
input to the next stage in the pipeline. During the fitting process, the
transformations defined in the pipeline stages are applied to the training
data, and the model specified in the last stage (regression) is trained using
the transformed data.
The result of calling fit() is a fitted pipeline model (pipeline_model) that
encapsulates the trained model along with the applied transformations. This
fitted pipeline model can be used to make predictions on new data or
evaluate the model’s performance.
455
In this code, we retrieve the coefficients and intercept of the trained linear
regression model from the fitted pipeline model (pipeline_model). Since the
linear regression model is the last stage in the pipeline, we access it using
pipeline_model.stages[-1]. The stages attribute of the pipeline model
contains a list of all stages in the pipeline, and -1
refers to the last stage. Since we have used two features (feature1 and
feature2) in the linear regression model, we have two resulting coefficients,
with approximate values of 0.22 and -0.29, respectively. The intercept
(approximately 0.13) is produced regardless of the number of features.
These results differ from those of the Scikit-Learn regression model,
suggesting that there are implementation differences between the two
platforms.
The output DataFrame predictions contains the original columns from the
test data, along with additional columns such as scaledFeatures and
prediction. The prediction column contains the predicted values generated
by the trained linear regression model for each corresponding row in the
test data.
[Out]:
label
prediction
8.46
-0.22
-9.10
-0.16
-7.05
-0.14
7.65
-0.15
5.21
-0.04
456
In this step, we would like to take a look at the top five rows of actual labels
and corresponding predicted values from the predictions DataFrame. The
select method is used to select the label and prediction columns. The show()
method is then used to retrieve the first five rows of the selected columns.
As the output shows, there are large differences between the two sets of
values, indicating that the model is making prediction errors. This is not
surprising given that we are using a simulated dataset, which is fine as our
purpose is to illustrate the concept of pipelining.
predictionCol="prediction", metricName="rmse")
In the next line, we use the evaluate method of the evaluator object to
calculate the RMSE. We pass the predictions DataFrame, which contains
the actual labels and predicted values, as the argument to the evaluate
method. The RMSE value is then assigned to the variable rmse.
It is worth noting that the RMSE value from PySpark (5.88) is different
from the value obtained by Scikit-Learn (5.55). Machine learning
algorithms, including regression, involve randomness during training,
which can lead to slightly different results between runs. Additionally,
PySpark and Scikit-Learn may use different optimization techniques for
their regression models, leading to variations in model predictions and
RMSE values.
457
Now, let’s consolidate all the relevant code from the preceding steps into a
single code block. This will enable the reader to execute the code as a
coherent entity in both Scikit-Learn and PySpark.
Scikit-Learn
[In]: np.random.seed(0)
[In]: X = np.random.rand(num_samples, 2) * 10
[In]: y = np.random.rand(num_samples) * 20 - 10
test_size=0.2, random_state=0)
regression)])
458
Chapter 17 pipelines with sCikit-learn and pyspark
[In]: df = pd.DataFrame(data)
[In]: print(df.head())
459
"id as id",
"(RAND() * 5) as feature2",
inputCol="features", outputCol="scaledFeatures")
featuresCol="scaledFeatures", labelCol="label")
460
predictionCol="prediction", metricName="rmse")
[In]: print("Root Mean Squared Error (RMSE) on test data:", rmse) Step 11:
Display top five actual vs. predicted
In the next chapter, which marks the final chapter of this book, we will
conclude by providing an in-depth exploration of the practical aspects of
deploying machine learning models in production using both Scikit-Learn
and PySpark.
461
CHAPTER 18
Deploying Models
Model deployment is a critical step in the machine learning life cycle, and
it’s important for several reasons. First, a machine learning model provides
value only when used to make predictions or automate tasks in a real-world
setting. Deployment ensures that the model is put to practical use, which
can lead to cost savings and efficiency improvements. Moreover,
deployment allows an organization to scale the use of its machine learning
model to serve a large number of users or handle a high volume of data. In a
production environment, the model can process data in real-time or batch
mode, making it suitable for various applications.
463
However, the reader should be aware that there are other ways to deploy
models in production, even though these methods are out of scope in this
chapter. One common approach is to deploy the algorithm as an API
service, allowing other applications and systems to make HTTP requests to
the model for predictions. Frameworks like Flask, Django, and FastAPI are
often used to build API endpoints. Another option is containerization.
Docker containers are used to package machine learning models along with
their dependencies into portable units. Kubernetes and other container
orchestration tools enable scaling and management of containerized models.
464
To begin, you train and evaluate your model using the provided algorithms,
such as MLPClassifier in Scikit-Learn and MultilayerPerceptronClassifier
in PySpark for neural networks. This involves preprocessing the data using
available modules, splitting it into training and testing sets, and employing
evaluation metrics like accuracy score to assess model performance.
dump() function, you can save the model as a file, ready to be loaded later
for deployment. Similarly, PySpark offers the ML persistence API, allowing
you to save both the model configuration and the trained weights using the
save() method.
The final step involves utilizing the saved models. In Scikit-Learn, you can
load the saved model file into memory using the joblib library and the
joblib.load() function. This loads the model, making it ready for making
predictions on new data.
persistence API. You can then use the loaded model to transform new data
by calling the transform() method.
465
The following functions and classes allow us to access and utilize the
necessary functionality from the Scikit-Learn library to train, evaluate,
save, and deploy the MLP
classifier:
• numpy: This library will be used to generate some data for new
predictions.
466
Chapter 18 Deploying MoDels in proDuCtion with sCikit-learn anD
pyspark
classification tasks.
• train_test_split: This is a utility function that will split our dataset into
training and testing subsets for the purpose of evaluating the
• joblib: This module provides tools for saving and loading Python
[In]: X = iris.data
[In]: y = iris.target
In this step, we load the Iris dataset using the Scikit-Learn built-in
load_iris() function. The features are assigned to the variable X while
corresponding target values are assigned to the variable y.
Step 3: Split the data into train and test sets
test_size=0.2, random_state=42)
This line of code uses the train_test_split function to split the data into
training and testing sets. X and y are the feature and label sets, respectively,
that we want to split.
The test_size=0.2 parameter specifies that 20% of the data will be used for
the test set, while 80% will be used for the training set. The
random_state=42 parameter ensures reproducibility. By using the same seed
value, we will get the same data split each time we run the code. The
training and testing sets are assigned to variables X_train, X_test, y_train,
and y_test, respectively.
467
This line of code is used to train the MLP classifier on the training data.
The fit() method trains the model using the provided input features
(X_train) and the corresponding target labels (y_train).
During the training process, the MLP classifier learns the relationships
between the input features and the target labels. It iteratively adjusts the
weights and biases of the neural network based on the training data, aiming
to minimize the error or loss function.
This line is used to make predictions on the test data using the trained MLP
classifier model. The predict() method applies the trained model to the input
data and returns the predicted class labels. For each sample in the test data,
it predicts the corresponding target label based on the learned patterns and
associations from the training data. The resulting predicted labels are stored
in the y_pred variable.
468
This code is used to evaluate the accuracy of the predictions made by the
MLP
The output indicates that the accuracy score for the MLP classifier is 1.0,
which means that 100% of all predictions were correct. This suggests that
the model has accurately classified each species of the Iris flower.
This line of code uses the joblib library to save the machine learning model
to a file named mlp_model.pkl using the dump() function. The mlp
parameter represents the machine learning model object. The
mlp_model.pkl parameter is the name of the file where the serialized model
will be saved. The .pkl extension is commonly used for files that store
serialized Python objects using the pickle protocol.
This step involves utilizing the saved model. We can load the saved model
file into memory using the joblib library and the joblib.load() function. This
loads the model, making it ready for making predictions on new data:
This line of code uses the load() function from the joblib library to load the
serialized MLP model from the mlp_model.pkl file and assign it to the
variable mlp. After executing this line of code, mlp will contain the
deserialized model object, which can be used for prediction.
469
[In]: np.random.seed(42)
This code generates new data using NumPy’s random module. The purpose
is to use this new data to demonstrate how the loaded model can be utilized.
The first line of code sets the random seed for the random number
generator. Setting the seed to 42 ensures that the random numbers generated
are reproducible.
The second line creates a NumPy array (X_new) that contains random
numbers. The rand() function generates random numbers from a uniform
distribution between 0 and 1.
The arguments (10, 4) specify the shape of the array to be created, where 10
represents the number of rows and 4 represents the number of columns.
[Ou]: Predictions: [1 1 1 0 1 1 0 1 1 1]
In this final step, we first use the predict() method of the mlp model to
make predictions on the new dataset X_new. The predict() method takes the
input data and returns the predicted values based on the trained model. The
predictions are stored in the variable y_new_pred. We then print the
predictions generated by the model using the print() function.
PySpark
We now switch focus to PySpark to show how to train and evaluate an MLP
classifier using Spark ML, persist the model, and then load it to make
predictions on new data.
MultilayerPerceptronClassificationModel,
MultilayerPerceptronClassifier
470
This code imports necessary classes for building, training, evaluating, and
deploying an MLP classification model in PySpark. Here is a description of
each of the imported classes:
information.
471
seed=42)
This line of code splits the Spark DataFrame (spark_df) into two separate
DataFrames, namely, train_data and test_data, based on a specified split
ratio. The randomSplit() method provided by Spark DataFrame does this
split randomly. It takes two arguments:
1) The [0.8, 0.2] argument is a list specifying the split ratios for the train
and test data, respectively. In this case, 80% of the data will
test_data.
seed is used, the random split will produce the same train-test
"petal_width"]
outputCol="features")
472
The last two commands use the transform method of the VectorAssembler
to apply the transformation to the training and testing data. This
transformation takes the specified feature columns from each row in the
DataFrames (train_data and test_data) and combines them into a single
vector column called features. This transformation is essential for machine
learning algorithms in PySpark, as they expect input features to be in vector
format.
The seed=42 parameter sets the random seed for the MLP classifier. It
ensures reproducibility of results by initializing the random number
generator to a specific state.
473
In this single line of code, the MLP classifier (mlp) is being trained on the
training data (train_data). The fit method is called to initiate the training
process and create a trained model. The mlp is the Multilayer Perceptron
classifier instance that was defined earlier. It encapsulates the architecture
and settings for the MLP network.
The fit(train_data) is the method call that starts the training process. It takes
the train_data DataFrame as input, which contains the labeled training
examples with their corresponding features and labels. The fit method
performs an iterative optimization process, where it updates the weights and
biases of the MLP network using the training examples. The number of
iterations is determined by the maxIter parameter specified when creating
the MLP classifier.
After the training process completes, the fit method returns a trained MLP
model (mlp_model). This model represents the learned patterns and
relationships in the training data and can be used to make predictions on
new, unseen data.
474
test data.
In the second line of code, accuracy is a variable that holds the result of
evaluating the accuracy metric on the predictions. The evaluate method is
called on the evaluator instance, passing in the predictions DataFrame. The
evaluate method compares the predicted labels in the prediction column of
the predictions DataFrame with the actual labels in the species column. It
calculates the accuracy, which is the fraction of correctly predicted labels
out of the total number of instances in the test data.
Finally, the accuracy value is printed using the print statement. This value is
1.0
or 100%, indicating that the model has made accurate predictions for each
of the Iris species. This means that the model’s predictions match the true
labels perfectly, resulting in a classification accuracy of 100%, which is a
strong indication of the model’s performance on the test data.
[In]: mlp_model.write().overwrite().save(model_dir)
475
The save() method saves the trained model to the specified directory path
(model_
dir). The trained model is serialized and stored in the specified location for
future use.
load(model_dir)
In this code, a saved Multilayer Perceptron model is loaded for further use.
The load() method is called on the
MultilayerPerceptronClassificationModel class, and the loaded model is
assigned to the variable loaded_model.
The model_dir variable contains the directory path where the trained model
was previously saved. The loaded_model variable is assigned the loaded
Multilayer Perceptron model, which can now be used for making
predictions on new data.
[In]: np.random.seed(42)
The first line of the code sets the random seed for the NumPy random
number generator. Setting a seed ensures that the random numbers
generated will be the same every time the code is run. The second line
generates a new NumPy array named X_new with dimensions 10×4. The
rand function is used to generate random numbers between 0 and 1. Each
element in the array represents a random feature value. The first argument,
10, represents the number of rows or samples in the array. The second
argument, 4, 476
The tolist() method converts the NumPy array X_new into a nested Python
list. This conversion is necessary because createDataFrame expects an input
that is compatible with Python data structures. The [“sepal_length”,
“sepal_width”, “petal_
Step 14: Assemble features into a single vector column for new data
[In]: new_predictions.select("prediction").show()
[Out]:
+----------+
|prediction|
477
+----------+
| 1.0|
| 1.0|
| 1.0|
| 0.0|
| 0.0|
| 1.0|
| 0.0|
| 1.0|
| 1.0|
| 1.0|
+----------+
In step 15, predictions are made on the new dataset using the loaded
Multilayer Perceptron model (loaded_model). The resulting predictions are
then displayed. More details are provided in the following text.
The first line of code applies the trained Multilayer Perceptron model
(loaded_
The second line selects the prediction column from the new_predictions
DataFrame and displays its contents. The select() method is called on
new_predictions with the column name prediction as the argument. The
show() method is then called to display the selected column.
In this section, we combine all the relevant code from the preceding steps
into a single code block. This will enable the reader to execute the code as a
coherent entity in both Scikit-Learn and PySpark.
478
size=0.2, random_state=42)
# Evaluate accuracy
479
Chapter 18 Deploying MoDels in proDuCtion with sCikit-learn anD
pyspark
[In]: np.random.seed(42)
PySpark
MultilayerPerceptronClassificationModel,
MultilayerPerceptronClassifier
# Create a SparkSession
[In]: X = iris.data
[In]: y = iris.target
480
"petal_width"]
outputCol="features")
# Evaluate accuracy
[In]: mlp_model.write().overwrite().save(model_dir)
load(model_dir)
[In]: np.random.seed(42)
[In]: new_predictions.select("prediction").show()
Summary
Index
Databricks, 1
Apache Mesos, 53
advantages, 213
dataset, 214–221
Pyspark
build() method, 46
libraries, 230–235
Scikit-Learn
arguments, 229
C
Colab notebook, 1
dataset, 76–82
collect() method, 82
hyperparameters, 83
libraries, 83
Pyspark
483
INDEX
code, 111–113
feature, 106
feature_importances_attribute, 248
functionality, 100
filter() function, 8
103, 104
float() function, 55
VectorAssembler, 101–103
Scikit-Learn, 85–87
importing libraries, 87
LabelEncoder, 91–93
Google Colab, 1, 3, 54
overfitting, 95, 99, 100
RMSE, 94
DecisionTreeRegressor(), 92
dataset, 144–147
PySpark
code, 169–171
dataset, 157
feature, 162
484
INDEX
(IDEs), 53
364, 383
397, 449
Hyperparameter tuning
examples, 417
Jupyter-like notebooks, 1
parameters, 417
Jupyter notebooks, 53
PySpark
K-fold cross-validation
definition, 25
output, 436
overfitting, 25
PySpark, 49–51
parameters, 432
predictions, 433
Scikit-Learn, 47–49
Scikit-Learn platform,
SparkSession, 428
36–39, 41, 42
spark_df DataFrame, 42
subsets, 25
Scikit-Learn
training/testing, 34–36
analysis, 424
K-means clustering
dataset, 396–399
interpretation, 426
machine learning
Scikit-Learn, 400–407
I
485
INDEX
tools, 396
metrics, 202
Kubernetes, 53
vectorize_features, 199
Scikit-Learn
functions, 190
load_diabetes_data() function, 26
129, 157
merged_pandas_df, 14
merge() method, 14
Microsoft Azure, 53
advantage, 173
MLP classifiers
disadvantages, 173
advantages, 299
features, 180
dataFrame, 315
dataset, 300–306
181, 182
PySpark
INDEX
overfitting, 299
applying model, 73
coefficients, 70, 71
evaluation metrics, 72
import libraries, 67
modeling process, 68
324, 325
predicted values, 71
train_data, 70
regression coefficients, 61
Scikit-Learn
Model deployment
code imports, 64
coefficients, 65
MLP, 466
import libraries, 63
steps, 465
new_data variable, 67
workflows, 464
train model, 63
PySpark
arguments, 472
classes, 471
code, 480–482
dataset, 282–286
libraries, 470
output, 289
VectorAssembler, 473
technique, 281
cleaning, 355
cleaning/tokenization/vectorization
cluster environments, 53
dependent variable, 61
382–388
PySpark
487
INDEX
constructor, 447
NLTK, 365–367
steps, 362
functions/classes, 444–446
dataset, 356–361
libraries, 443
significance
vectorization, 362
StandardScaler, 442
plot_learning_curve function, 98
NumPy’s sort() function, 333
plt.show() method, 98
PySpark
aggregating data, 9, 10
P, Q
loading data, 5
pandas_output.csv, 17
Pandas integration, 2, 4
pipelines, 20–22
Pipelines, 440
saving data, 17
benefits, 442
syntax, 4
maintainability, 443
regression, 441
robust/scalable models, 441
Scikit-Learn
coefficients, 448
488
INDEX
methods, 243
PySpark
code, 255–258
libraries, 249–252
Scikit-Learn, 1
Scikit-Learn
330, 333
spark.read.csv() method, 5
Scikit-Learn, 122–124,
SparkSession.builder.appName() method,
126–128, 137–139
SparkSession.builder.getOrCreate()
StandardScaler, 461
Recommender systems
sum() method, 31
dataset, 331–339
PySpark, 345–350
351, 352
advantages, 259
dataset, 260–264
kernel, 259
PySpark
489
INDEX
libraries, 269–273
Scikit-Learn
library, 264–267
value_counts() method, 31
VectorAssembler, 44
373, 394
toPandas() method, 2
W, X, Y
265, 375
490
OceanofPDF.com
Document Outline
Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: An Easy Transition
PySpark and Pandas Integration
Similarity in Syntax
Loading Data
Selecting Columns
Aggregating Data
Filtering Data
Joining Data
Saving Data
Modeling Steps
Pipelines
Summary
Chapter 2: Selecting Algorithms
The Dataset
Selecting Algorithms with Cross-Validation
Scikit-Learn
PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 3: Multiple Linear Regression with Pandas, Scikit-Learn, and
PySpark
The Dataset
Multiple Linear Regression
Multiple Linear Regression with Scikit-Learn
Multiple Linear Regression with PySpark
Summary
Chapter 4: Decision Tree Regression with Pandas, Scikit-Learn, and
PySpark
The Dataset
Decision Tree Regression
Decision Tree Regression with Scikit-Learn
The Modeling Steps
Decision Tree Regression with PySpark
The Modeling Steps
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 5: Random Forest Regression with Pandas, Scikit-Learn, and
PySpark
The Dataset
Random Forest Regression
Random Forest with Scikit-Learn
Random Forest with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 6: Gradient-Boosted Tree Regression with Pandas, Scikit-
Learn, and PySpark
The Dataset
Gradient-Boosted Tree (GBT) Regression
GBT with Scikit-Learn
GBT with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 7: Logistic Regression with Pandas, Scikit-Learn, and
PySpark
The Dataset
Logistic Regression
Logistic Regression with Scikit-Learn
Logistic Regression with PySpark
Putting It All Together
Scikit-Learn
PySpark
Summary
Chapter 8: Decision Tree Classification with Pandas, Scikit-Learn, and
PySpark
The Dataset
Decision Tree Classification
Scikit-Learn and PySpark Similarities
Decision Tree Classification with Scikit-Learn
Decision Tree Classification with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 9: Random Forest Classification with Scikit- Learn and
PySpark
Random Forest Classification
Scikit-Learn and PySpark Similarities for Random Forests
Random Forests with Scikit-Learn
Random Forests with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 10: Support Vector Machine Classification with Pandas,
Scikit-Learn, and PySpark
The Dataset
Support Vector Machine Classification
Linear SVM with Scikit-Learn
Linear SVM with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 11: Naive Bayes Classification with Pandas, Scikit-Learn, and
PySpark
The Dataset
Naive Bayes Classification
Naive Bayes with Scikit-Learn
Naive Bayes with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 12: Neural Network Classification with Pandas, Scikit-Learn,
and PySpark
The Dataset
MLP Classification
MLP Classification with Scikit-Learn
MLP Classification with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 13: Recommender Systems with Pandas, Surprise, and
PySpark
The Dataset
Building a Recommender System
Recommender System with Surprise
Recommender System with PySpark
Bringing It All Together
Surprise
PySpark
Summary
Chapter 14: Natural Language Processing with Pandas, Scikit-Learn,
and PySpark
The Dataset
Cleaning, Tokenization, and Vectorization
Naive Bayes Classification
Naive Bayes with Scikit-Learn
Naive Bayes with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 15: k-Means Clustering with Pandas, Scikit-Learn, and
PySpark
The Dataset
Machine Learning with k-Means
k-Means Clustering with Scikit-Learn
k-Means Clustering with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 16: Hyperparameter Tuning with Scikit-Learn and PySpark
Examples of Hyperparameters
Tuning the Parameters of a Random Forest
Hyperparameter Tuning in Scikit-Learn
Hyperparameter Tuning in PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 17: Pipelines with Scikit- Learn and PySpark
The Significance of Pipelines
Pipelines with Scikit-Learn
Pipelines with PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Chapter 18: Deploying Models in Production with Scikit- Learn and
PySpark
Steps in Model Deployment
Deploying a Multilayer Perceptron (MLP)
Deployment with Scikit-Learn
PySpark
Bringing It All Together
Scikit-Learn
PySpark
Summary
Index
OceanofPDF.com