Intro To Analytics and ML With Sparklyr
Intro To Analytics and ML With Sparklyr
ML
with Sparklyr
Lotfi NAJDI
2 / 63
Connecting to Spark
We can connect to both local instances of Spark as well as remote Spark
clusters.
Spark cluster.
3 / 63
Using dplyr verbs
copy_to
We’ll start by using copy_to to upload a local data frame (R) into a remote data
source(Spark cluster)
copy_to(sc,nycflights13::flights, "nycflights13_spark")
4 / 63
Using dplyr verbs
src_tbls
We can use src_tbls to list all tbls provided by the remote source
src_tbls(sc)
# [1] "nycflights13_spark"
5 / 63
Using dplyr verbs
tbl
Now that we’ve copied the data, we can use tbl() to take a reference to it:
nycflights13_tbl
# # Source: spark<nycflights13_spark> [?? x 19]
# year month day dep_time sched_dep_time dep_delay
# <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 517 515 2
# 2 2013 1 1 533 529 4
# 3 2013 1 1 542 540 2
# 4 2013 1 1 544 545 -1
# 5 2013 1 1 554 600 -6
# 6 2013 1 1 554 558 -4
# 7 2013 1 1 555 600 -5
# 8 2013 1 1 557 600 -3
# 9 2013 1 1 557 600 -3
# 10 2013 1 1 558 600 -2
# # ... with more rows, and 13 more variables:
# # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, 6 / 63
Using dplyr verbs
We can create spark data frame and also take a reference to it at the same
time :
7 / 63
Your turn 1
Use spark_connect() function to connect to a local instance of Spark:
Upload a local data frame (R) into a remote data using copy_to
src_tbls(sc)
8 / 63
Using dplyr verbs
to analyse data
Let's try to analyse data on flight delays from nycflights13_spark :
group_by(tailnum) %>%
arrange(desc(delay)) %>%
The sequence of operations is not executed until we ask for the data
(e.g., by printing tailnum_delay), that dplyr generates the SQL and requests
Even then it tries to do as little work as possible and only pulls down a few
rows. 9 / 63
Using dplyr verbs
to analyse data
tailnum_delay
# # Source: spark<?> [?? x 3]
# # Ordered by: desc(delay)
# tailnum delay n
# <chr> <dbl> <dbl>
# 1 N14228 3.71 111
# 2 N24211 7.7 130
# 3 N793JB 4.72 283
# 4 N657JB 5.03 285
# 5 N730MQ 1.2 178
# 6 N9EAMQ 9.24 248
# 7 N705TW -7.09 293
# 8 N318NB -1.12 202
# 9 N627VA 7.53 120
# 10 N646JB 5.00 260
# # ... with more rows
10 / 63
Using dplyr verbs
show_query()
11 / 63
Using dplyr verbs
collect()
Typically, you’ll iterate a few times before you figure out what data you need
from the database.
Once you’ve figured it out, use collect() to pull all the data down into a local
tibble:
For example to count the number of flights by carrier and month the spark
data frame nycflights13_spark.
nycflights13_tbl %>%
count(carrier,month) %>%
arrange(-n)
use collect() to pull all the data down into a local tibble:
group_by(tailnum) %>%
arrange(desc(delay)) %>%
tailnum_delay
14 / 63
Reading Data
Example of parquet file
Apache Software Foundation
/ Apache Parquet (http://parquet.apache.org)
15 / 63
Reading Data
Example of parquet file
16 / 63
Reading Data
Example of parquet file
spark_read_parquet reads a parquet file and provides a data source
src_tbls(sc) 17 / 63
Explore Data
Use dplyr verb to count the total number of flights in this dataset
18 / 63
Explore Data
We can see that there are lot of NA values in
ARR_DELAY column.
flights15_tbl %>%
filter(is.na(ARR_DELAY)) %>%
count()
# # Source: spark<?> [?? x 1]
# n
# <dbl>
# 1 25293
19 / 63
Explore Data
flights15_tbl %>%
filter(CANCELLED == 1 )
# # Source: spark<?> [?? x 16]
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 2015 1 27 2 AA 1
# 2 2015 1 28 3 AA 1
# 3 2015 1 26 1 AA 2
# 4 2015 1 27 2 AA 2
# 5 2015 1 27 2 AA 3
# 6 2015 1 28 3 AA 3
# 7 2015 1 26 1 AA 4
# 8 2015 1 27 2 AA 4
# 9 2015 1 1 4 AA 6
# 10 2015 1 3 6 AA 6
# # ... with more rows, and 10 more variables: ORIGIN <chr>,
# # DEST <chr>, DEP_TIME <dbl>, DEP_DELAY <dbl>,
# # ARR_TIME <dbl>, ...
We should also create a new column called DEP_HOUR which will have
21 / 63
Your turn 3
Use the appropriate function to read the parquet file for 2015 flights
flights15_tbl %>%
filter(is.na(ARR_DELAY)) %>%
count()
22 / 63
Data preparation
flights15_prepared_tbl <- flights15_tbl %>%
filter(!is.na(ARR_DELAY)) %>%
filter(CANCELLED ==0 ) %>%
mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%
mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
)
23 / 63
Your turn 4
Complete the following chunk in order to :
Create a new column called DEP_HOUR which will have extracted hour
value from DEP_TIME column.
filter(is.na(ARR_DELAY)) %>%
filter(CANCELLED == 0 ) %>%
mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
)
24 / 63
Data preparation
This time we will use ft_binarizer() to identify “delayed” flights.
of data within a Spark DataFrame, and sparklyr exposes these within the ft_*
family of functions.
That means all flights which are arrived 15 minutes delayed are considered to
be
delayed.
Let's introduce new variable called STATUS with two values "DELAYED" and
"ONTIME".
26 / 63
Spark Caching
sdf_register
Registers a Spark DataFrame (giving it a table name for the Spark SQL
context), and returns a tbl_spark.
sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")
# # Source: spark<flights15_prepared_spark> [?? x 19]
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <chr> <dbl> <chr> <chr> <dbl>
# 1 2015 1 1 4 AA 1
# 2 2015 1 2 5 AA 1
# 3 2015 1 3 6 AA 1
# 4 2015 1 4 7 AA 1
# 5 2015 1 5 1 AA 1
# 6 2015 1 6 2 AA 1
# 7 2015 1 7 3 AA 1
# 8 2015 1 8 4 AA 1
27 / 63
# 9 2015 1 9 5 AA 1
Spark Caching
tbl_cache
Force a Spark table with name name to be loaded into memory.
The tbl_cache command loads the results into an Spark RDD in memory,
smaller than the original file because the transformations created a smaller
data set than the
original file.
#tbl_uncache(sc, "flights_data")
tbl_cache(sc, "flights15_prepared_spark")
28 / 63
Spark Caching
tbl_cache
src_tbls(sc)
# [1] "flights15_prepared_spark" "flights15_spark"
# [3] "nycflights13_spark"
29 / 63
Your turn 6
Using sdf_register and tbl_cache commands, load the results form
flights15_prepared_tbl
into an Spark RDD in memory, so any analysis from there on will not need to
re-read and
tbl_cache : Force a Spark table with name name to be loaded into memory.
sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")
30 / 63
visualizing the data
Before creating a model, let's start by visualizing the data.
31 / 63
visualizing the data
flights_delay_perc %>%
ggplot(aes(x = STATUS, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
coord_flip()
32 / 63
visualizing the data
Let us explore the effect of Day_Of_Week has on Delay.
library(ggrepel)
flights_delay_day %>% ggplot( aes(x=DAY_OF_WEEK,y=Percentage, fill = STA
geom_col(position = "stack")+
geom_path(aes( color = STATUS ) ) +
geom_text_repel(aes(label=Percentage), size = 3) +
ggtitle("Percentage of Flights Delayed") +
labs(title = "Percentage of Flights Delayed" , x="Day of Week",y="Per
34 / 63
Your turn 7
To explore the effect of Day_Of_Week on Delay,compute
group_by(DAY_OF_WEEK) %>%
collect()
flights_delay_day
35 / 63
visualizing the data
Now we will look over Destination effect on the delay.
flights_delay_dest
# # Source: spark<?> [?? x 4]
# # Groups: DEST
# DEST STATUS n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ABE ONTIME 249 86.5
# 2 ABE DELAYED 39 13.5
# 3 ABQ ONTIME 2614 82.8
# 4 ABQ DELAYED 545 17.2
# 5 AGS ONTIME 532 82.0
# 6 AGS DELAYED 117 18.0
# 7 ALB DELAYED 357 18.3
# 8 ALB ONTIME 1593 81.7
# 9 ANC ONTIME 2141 75.8
# 10 ANC DELAYED 682 24.2
# # ... with more rows 36 / 63
visualizing the data
flights_delay_dest %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(DEST = fct_reorder(DEST, Percentage)) %>%
ggplot(aes(x = DEST, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
#facet_wrap( STATUS) +
coord_flip()
37 / 63
visualizing the data
Now we will look over Origin effect on the delay.
flights_delay_origin
# # Source: spark<?> [?? x 4]
# # Groups: ORIGIN
# ORIGIN STATUS n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ABE ONTIME 265 91.7
# 2 ABE DELAYED 24 8.3
# 3 ABQ ONTIME 2720 86.3
# 4 ABQ DELAYED 433 13.7
# 5 AGS ONTIME 562 86.3
# 6 AGS DELAYED 89 13.7
# 7 ALB DELAYED 246 12.7
# 8 ALB ONTIME 1695 87.3
# 9 ANC ONTIME 2289 81.3
# 10 ANC DELAYED 525 18.7
# # ... with more rows 38 / 63
visualizing the data
flights_delay_origin %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(ORIGIN = fct_reorder(ORIGIN, Percentage)) %>%
ggplot(aes(x = ORIGIN, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
coord_flip()
39 / 63
Analytic workflow with sparklyr
An analytic workflow with sparklyr might be composed of the following
stages.
2. Use the sdf_* and ft_* family of functions to generate new columns, or
partition your data set,
4. Inspect the quality of your model fit, and use it to make predictions with
new data.
Collect the results for visualization and further analysis in R
40 / 63
Data partitionning
sdf_random_split() could be used to partition the data
41 / 63
Fit the model to data
ml_* family functions
summary(flights_model)
# Coefficients:
# (Intercept) DEP_DELAY DISTANCE DEP_HOUR
# -2.96957935415 0.12868564313 -0.00007834523 -0.00301209314
42 / 63
Run predictions in Spark
Quick review of running predictions and reviewing accuracy
43 / 63
Your turn 8
Partition the flights15_prepared_cached datframe using
sdf_random_split()
Fit logistic_regression to training partition
Use ml_predict to make predictions on the testing set
Execute the last chunk to compute AUC
--
predictions
-For R users, the insights gathered during the interactive sessions with Spark
-This makes the hand-off from Data Scientists to Big Data Engineers a lot
easier.
45 / 63
Spark ML Pipelines
-The final list of selected variables, data manipulation, feature transformations
which means that the code can be added to the scheduled Spark ML jobs, and
without any
dependencies in R.
46 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
-Pipelines make heavy use of Feature Transformers.
-These functions use the Spark API directly to transform the data,
and may be faster at making the data manipulations that a dplyr (SQL)
transformation.
-In sparklyr the ft functions are essentially are wrappers to original Spark
feature
transformer.
47 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
-We will start with dplyr transformations, which are ultimately SQL
transformations, loaded into the df variable.
-In sparklyr, there is one feature transformer that is not available in Spark,
ft_dplyr_transformer().
-this transformer starts by extracting the dplyr transformations from the tbl
object as a SQL statement then pipe it to the ft_sql_transformer().
-The goal of this function is to convert the dplyr code to a SQL Feature
Transformer that can then be used in a Pipeline.
48 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
ft_dplyr_transformer
filter(!is.na(ARR_DELAY)) %>%
mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
) %>%
select(ARR_DELAY,DEP_DELAY, MONTH, DAY_OF_WEEK, DISTANCE ,DEP_HOUR , D
49 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
FT_DPLYR_TRANSFORMER
50 / 63
Spark ML Pipelines
flights_pipeline <- ml_pipeline(sc) %>%
ml_logistic_regression()
51 / 63
Spark ML Pipelines
flights_pipeline
# Pipeline (Estimator) with 4 stages
# <pipeline__62ac7cd1_0549_42ea_9752_59a1c3c356a6>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__98bbbc43_cc8c_4fa6_801b_8618acdcc5a8>
# | (Parameters -- Column Names)
# |--2 Binarizer (Transformer)
# | <binarizer__ff84263c_96c3_49e5_b000_3e1b555b1945>
# | (Parameters -- Column Names)
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormula (Estimator)
# | <r_formula__fd48b47b_e719_4bf7_a38d_eb3811f8f7c3>
# | (Parameters -- Column Names)
# | features_col: features
# | label_col: label
# | (Parameters)
# | force_index_label: FALSE
# | formula: DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN +
# | handle_invalid: error
# | stringIndexerOrderType: frequencyDesc
52 / 63
# |--4 LogisticRegression (Estimator)
Your turn 9
Recreate the code to process data using dplyr transformations and assign
it to a new
variable called df
filter(!is.na(ARR_DELAY)) %>%
ml_logistic_regression()
flights_pipeline
54 / 63
Data partitionning
sdf_random_split() to partition the data
55 / 63
Fit the ML Pipeline
The ml_fit() function produces the Pipeline Model.
The training partition of the partitioned_flights data is used to train the model:
56 / 63
Fit the ML Pipeline
Notice that the print-out for the fitted pipeline now displays the model’s
coefficients.
fitted_pipeline
# PipelineModel (Transformer) with 4 stages
# <pipeline__f2c230f2_69dd_4839_a372_54ced3c35148>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__6b2f03bc_ec74_4988_906d_ed5b91b10b65>
# | (Parameters -- Column Names)
# |--2 Binarizer (Transformer)
# | <binarizer__9c8f1ce4_ab4b_4609_9a42_70e2a2842b77>
# | (Parameters -- Column Names)
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormulaModel (Transformer)
# | <r_formula__2388ea95_1517_4dbf_b273_d5f52e7b652d>
# | (Parameters -- Column Names)
# | features_col: features
# | label_col: label
# | (Transformer Info)
# | formula: chr "DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + OR
# |--4 LogisticRegressionModel (Transformer) 57 / 63
Make prediction using the
fitted Pipeline
The ml_predict() function can be used to run predictions.
58 / 63
Evaluate model performance
predictions %>%
group_by(DELAYED, prediction) %>%
tally()
# # Source: spark<?> [?? x 3]
# # Groups: DELAYED
# DELAYED prediction n
# <dbl> <dbl> <dbl>
# 1 0 1 5195
# 2 0 0 284697
# 3 1 1 41794
# 4 1 0 18691
59 / 63
Your turn 10
Fit the flights_pipeline pipeline using the training data and ml_fit() function .
Use the fitted model to perform predictions using testing data and
ml_predict() function .
predictions
# # Source: spark<?> [?? x 14]
# ARR_DELAY DEP_DELAY MONTH DAY_OF_WEEK DISTANCE DEP_HOUR
# <dbl> <dbl> <chr> <chr> <dbl> <dbl>
# 1 -17 -6 1 4 2475 12
# 2 195 178 1 4 3711 15
# 3 109 108 1 4 3784 19
# 4 6 -11 1 4 2475 6 60 / 63
Save the fitted pipeline
The ml_save() command can be used to save the Pipeline and PipelineModel
to disk.
The resulting output is a folder with the selected name, which contains all of
the necessary Scala scripts:
61 / 63
Lode the fitted pipeline
The ml_load() command can be used to re-load Pipelines and
PipelineModels.
The saved ML Pipeline files can only be loaded into an open Spark session.
A simple query can be used as the table that will be used to make the new
predictions.
This of course, does not have to be done in R, at this time the “flights_model”
can be
62 / 63
Disconnect from the cluster
After you are done processing data, you should terminates the connection to
the cluster as
spark_disconnect(sc)
63 / 63