Intro To Analytics and ML With Sparklyr

Introduction to analytics and
ML
with Sparklyr
Lotfi NAJDI
2 / 63
Connecting to Spark
We can connect to both local instances of Spark as well as remote Spark
clusters.
Use spark_connect() function to connect to a local instance of Spark:
sc <- spark_connect(master = "local")
The returned Spark connection(sc)provides a remote dplyr data source to the
Spark cluster.
3 / 63
Using dplyr verbs
copy_to
We’ll start by using copy_to to upload a local data frame (R) into a remote data
source(Spark cluster)
copy_to(sc,nycflights13::flights, "nycflights13_spark")
4 / 63
Using dplyr verbs
src_tbls
We can use src_tbls to list all tbls provided by the remote source
src_tbls(sc)
# [1] "nycflights13_spark"
5 / 63
Using dplyr verbs
tbl
Now that we’ve copied the data, we can use tbl() to take a reference to it:
nycflights13_tbl <- tbl(sc,"nycflights13_spark")
nycflights13_tbl
# # Source: spark<nycflights13_spark> [?? x 19]
# year month day dep_time sched_dep_time dep_delay
# <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 517 515 2
# 2 2013 1 1 533 529 4
# 3 2013 1 1 542 540 2
# 4 2013 1 1 544 545 -1
# 5 2013 1 1 554 600 -6
# 6 2013 1 1 554 558 -4
# 7 2013 1 1 555 600 -5
# 8 2013 1 1 557 600 -3
# 9 2013 1 1 557 600 -3
# 10 2013 1 1 558 600 -2
# # ... with more rows, and 13 more variables:
# # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, 6 / 63
Using dplyr verbs
We can create spark data frame and also take a reference to it at the same
time :
nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar
7 / 63
Your turn 1
Use spark_connect() function to connect to a local instance of Spark:
Upload a local data frame (R) into a remote data using copy_to
Use src_tbls to list all tbls provided by the remote source
sc <- spark_connect(master = "local")
nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar
src_tbls(sc)
8 / 63
Using dplyr verbs
to analyse data
Let's try to analyse data on flight delays from nycflights13_spark :
tailnum_delay <- nycflights13_tbl %>%
group_by(tailnum) %>%
summarise(delay = mean(arr_delay), n = n() ) %>%
arrange(desc(delay)) %>%
filter(n > 100)
The sequence of operations is not executed until we ask for the data
(e.g., by printing tailnum_delay), that dplyr generates the SQL and requests
the results from the remote data source.
Even then it tries to do as little work as possible and only pulls down a few
rows. 9 / 63
Using dplyr verbs
to analyse data
tailnum_delay
# # Source: spark<?> [?? x 3]
# # Ordered by: desc(delay)
# tailnum delay n
# <chr> <dbl> <dbl>
# 1 N14228 3.71 111
# 2 N24211 7.7 130
# 3 N793JB 4.72 283
# 4 N657JB 5.03 285
# 5 N730MQ 1.2 178
# 6 N9EAMQ 9.24 248
# 7 N705TW -7.09 293
# 8 N318NB -1.12 202
# 9 N627VA 7.53 120
# 10 N646JB 5.00 260
# # ... with more rows
10 / 63
Using dplyr verbs
show_query()
Behind the scenes, dplyr is translating your R code into SQL.
You can see the SQL it’s generating with show_query():
tailnum_delay %>% show_query()

# <SQL>
# SELECT *
# FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
# FROM `nycflights13_spark`
# GROUP BY `tailnum`) `q01`
# WHERE (`n` > 100.0)
11 / 63
Using dplyr verbs
collect()
Typically, you’ll iterate a few times before you figure out what data you need
from the database.
Once you’ve figured it out, use collect() to pull all the data down into a local
tibble:
tailnum_delay %>% collect() #collect() retrieves data into a local tibb

# # A tibble: 1,201 x 3
# tailnum delay n
# <chr> <dbl> <dbl>
# 1 N14228 3.71 111
# 2 N24211 7.7 130
# 3 N793JB 4.72 283
# 4 N657JB 5.03 285
# 5 N730MQ 1.2 178
# 6 N9EAMQ 9.24 248
# 7 N705TW -7.09 293
# 8 N318NB -1.12 202
# 9 N627VA 7.53 120
# 10 N646JB 5.00 260
12 / 63
# # ... with 1,191 more rows
Using SQL
to analyse Saprk data
If you’re familiar with SQL, you can write SQL directly in order to analyse the
data.
For example to count the number of flights by carrier and month the spark
data frame nycflights13_spark.
nycflights13_tbl %>%
count(carrier,month) %>%
arrange(-n)
SELECT carrier, month , COUNT(*) as n

FROM nycflights13_spark
group by carrier, month
order by n desc
Table: Displaying records 1 - 185

|carrier | month| n|
|:-------|-----:|----:|
|UA | 8|
5124|
|UA | 7| 5066|
|UA | 10| 5060|
|UA | 4| 5047|
|B6 | 7| 4984|
|UA | 6| 4975|
|UA |
3| 4971|
|UA | 5| 4960|
|B6 | 8| 4952|
|UA | 12| 4931| 13 / 63
Your turn 2
Compute the mean and the number of delay by tailnum (plane)
Show the SQL produced by dplyr with show_query()
use collect() to pull all the data down into a local tibble:
tailnum_delay <- nycflights13_tbl %>%
group_by(tailnum) %>%
summarise(delay = mean(arr_delay), n = n() ) %>%
arrange(desc(delay)) %>%
filter(n > 100)
tailnum_delay %>% show_query()
tailnum_delay
14 / 63
Reading Data
Example of parquet file
Apache Software Foundation
/ Apache Parquet (http://parquet.apache.org)
Apache Parquet is a columnar storage

(http://en.wikipedia.org/wiki/Column-oriented_DBMS) format available
to any project in the Hadoop ecosystem, regardless of the choice of
data processing framework, data model or programming language.
Parquet Videos (more presentations (/presentations))
0605 Efficient Data Storage for Analytics with …
15 / 63
Reading Data
16 / 63
Reading Data
spark_read_parquet reads a parquet file and provides a data source
compatible with dplyr.
This function returns a reference to a Spark DataFrame which can be used
as a dplyr table (tbl)
flights15_tbl <- spark_read_parquet(sc,

name = "flights15_spark",
path = "data/airline_data_2015/parquet/par
memory = FALSE)
memory : Boolean argument which indicates if the data should be loaded

eagerly into memory? (That is, should the table be cached?)
We can use
src_tbls to list all tbls provided by the remote source
src_tbls(sc) 17 / 63
Explore Data
Use dplyr verb to count the total number of flights in this dataset
flights15_tbl %>% count()

# n
# <dbl>
# 1 1782884
Or the number of flights by CARRIER in this dataset
flights15_tbl %>% count(CARRIER)

# CARRIER n
# <chr> <dbl>
# 1 DL 746017
# 2 UA 424022
# 3 AA 612845
18 / 63
Explore Data
We can see that there are lot of NA values in
ARR_DELAY column.
flights15_tbl %>%
filter(is.na(ARR_DELAY)) %>%
count()
# n
# <dbl>
# 1 25293
We should keep only those where we have valid readings of ARR_DELAY.
19 / 63
Explore Data
flights15_tbl %>%
filter(CANCELLED == 1 )
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 2015 1 27 2 AA 1
# 2 2015 1 28 3 AA 1
# 3 2015 1 26 1 AA 2
# 4 2015 1 27 2 AA 2
# 5 2015 1 27 2 AA 3
# 6 2015 1 28 3 AA 3
# 7 2015 1 26 1 AA 4
# 8 2015 1 27 2 AA 4
# 9 2015 1 1 4 AA 6
# 10 2015 1 3 6 AA 6
# # ... with more rows, and 10 more variables: ORIGIN <chr>,
# # DEST <chr>, DEP_TIME <dbl>, DEP_DELAY <dbl>,
# # ARR_TIME <dbl>, ...
In order to analyse ARR_DELAY, We will keep only those flight records
where it did not get cancelled.

20 / 63
Explore Data
flights15_tbl %>% select(DEP_TIME)
# DEP_TIME
# <dbl>
# 1 855
# 2 850
# 3 853
# 4 853
# 5 853
# 6 856
# 7 859
# 8 856
# 9 901
# 10 903
We should also create a new column called DEP_HOUR which will have
extracted hour value from DEP_TIME column.
21 / 63
Your turn 3
Use the appropriate function to read the parquet file for 2015 flights
and provides a data source compatible with dplyr(flights15_tbl).
Display the number of flights by CARRIER in this dataset .
Compute the number of NA values in ARR_DELAY column .
flights15_tbl <- spark_read_parquet(sc,

name = "flights15_spark",
path = "data/airline_data_2015/parqu
memory = FALSE)
flights15_tbl %>%
count()
flights15_tbl %>% count(CARRIER)
22 / 63
Data preparation
flights15_prepared_tbl <- flights15_tbl %>%
filter(!is.na(ARR_DELAY)) %>%
filter(CANCELLED ==0 ) %>%
mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%
mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
)
23 / 63
Your turn 4
Complete the following chunk in order to :
keep only records where we have valid readings of ARR_DELAY.
filter out records related to Cancelled flighs.
Create a new column called DEP_HOUR which will have extracted hour
value from DEP_TIME column.
flights15_prepared_tbl <- flights15_tbl %>%
filter(CANCELLED == 0 ) %>%
)
24 / 63
Data preparation
This time we will use ft_binarizer() to identify “delayed” flights.
Spark provides feature transformers,facilitating many common

transformations
of data within a Spark DataFrame, and sparklyr exposes these within the ft_*
family of functions.
Let us introduce a new column called DELAYED which has value 1
if the arrival delay(ARR_DELAY) is more than 15 minutes and 0 if ARR_DELAY is

less than 15 minutes.
That means all flights which are arrived 15 minutes delayed are considered to
be
delayed.
flights15_prepared_tbl <- flights15_prepared_tbl %>%

ft_binarizer(
input_col = "ARR_DELAY",
output_col = "DELAYED",
threshold = 15
) %>%
25 / 63
mutate(STATUS = ifelse(DELAYED ==1 , "DELAYED", "ONTIME"))
Your turn 5
Using ft_binarizer() , try to identify “delayed” flights, the new
Column (output_col) is called DELAYED and has value 1 if
the arrival delay(ARR_DELAY) is more than 15 minutes and 0
if ARR_DELAY is less than 15 minutes.
Let's introduce new variable called STATUS with two values "DELAYED" and
"ONTIME".

ft_binarizer(
input_col = "ARR_DELAY",
threshold = 15
)

mutate(STATUS = ifelse(DELAYED ==1 , "DELAYED", "ONTIME"))
26 / 63
Spark Caching
sdf_register
Registers a Spark DataFrame (giving it a table name for the Spark SQL
context), and returns a tbl_spark.
sdf_register will register the resulting Spark SQL in Spark.
The results will show up as a table called flights15_prepared_spark,but a table

of the same name is still not loaded into memory in Spark.
sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")
# # Source: spark<flights15_prepared_spark> [?? x 19]
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <chr> <dbl> <chr> <chr> <dbl>
# 1 2015 1 1 4 AA 1
# 2 2015 1 2 5 AA 1
# 3 2015 1 3 6 AA 1
# 4 2015 1 4 7 AA 1
# 5 2015 1 5 1 AA 1
# 6 2015 1 6 2 AA 1
# 7 2015 1 7 3 AA 1
# 8 2015 1 8 4 AA 1
27 / 63
# 9 2015 1 9 5 AA 1
Spark Caching
tbl_cache
Force a Spark table with name name to be loaded into memory.
Operations on cached tables should normally (although not always)
be more performant than the same operation performed on an uncached

table.
The tbl_cache command loads the results into an Spark RDD in memory,
so any analysis from there on will not need to re-read and
re-transform the original file. The resulting Spark RDD is
smaller than the original file because the transformations created a smaller
data set than the
original file.
#tbl_uncache(sc, "flights_data")
tbl_cache(sc, "flights15_prepared_spark")
28 / 63
Spark Caching
tbl_cache
src_tbls(sc)
# [1] "flights15_prepared_spark" "flights15_spark"
# [3] "nycflights13_spark"
flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")
29 / 63
Your turn 6
Using sdf_register and tbl_cache commands, load the results form
flights15_prepared_tbl
into an Spark RDD in memory, so any analysis from there on will not need to
re-read and
re-transform the original file.
sdf_register : Registers a Spark DataFrame (giving it a table name for the

Spark SQL context),
and returns a tbl_spark.
tbl_cache : Force a Spark table with name name to be loaded into memory.
sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")
flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")
30 / 63
visualizing the data
Before creating a model, let's start by visualizing the data.
flights_delay_perc <- flights15_prepared_cached %>%

count(STATUS) %>%
mutate(Percentage = n / sum(n )) %>%
mutate( Percentage = round(Percentage * 100 ,2) ) %>%
collect()
flights_delay_perc
# # A tibble: 2 x 3
# STATUS n Percentage
# <chr> <dbl> <dbl>
# 1 ONTIME 1455442 82.8
# 2 DELAYED 302149 17.2
31 / 63
flights_delay_perc %>%
ggplot(aes(x = STATUS, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
coord_flip()
32 / 63
Let us explore the effect of Day_Of_Week has on Delay.
flights_delay_day <- flights15_prepared_cached %>%

count(STATUS, DAY_OF_WEEK) %>%
group_by(DAY_OF_WEEK) %>%
mutate( Percentage = round(Percentage *100,2) ) %>%
collect()
flights_delay_day
# # A tibble: 14 x 4
# STATUS DAY_OF_WEEK n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ONTIME 1 214466 81.2
# 2 DELAYED 1 49732 18.8
# 3 DELAYED 2 44510 16.9
# 4 ONTIME 2 218690 83.1
# 5 DELAYED 3 44142 16.8
# 6 ONTIME 3 218404 83.2
# 7 ONTIME 4 211771 81.1
# 8 DELAYED 4 49442 18.9
# 9 ONTIME 5 212147 82.8
# 10 DELAYED 5 44103 17.2
# # ... with 4 more rows
33 / 63
Let us explore the effect of Day_Of_Week has on Delay.
library(ggrepel)
flights_delay_day %>% ggplot( aes(x=DAY_OF_WEEK,y=Percentage, fill = STA
geom_col(position = "stack")+
geom_path(aes( color = STATUS ) ) +
geom_text_repel(aes(label=Percentage), size = 3) +
ggtitle("Percentage of Flights Delayed") +
labs(title = "Percentage of Flights Delayed" , x="Day of Week",y="Per
34 / 63
Your turn 7
To explore the effect of Day_Of_Week on Delay,compute
the percentage of delayed flights
Produce a graph of your choice to visualize the results
flights_delay_day <- flights15_prepared_cached %>%
count(STATUS, DAY_OF_WEEK) %>%
group_by(DAY_OF_WEEK) %>%
mutate( Percentage = round(Percentage *100,2) ) %>%
collect()
flights_delay_day
35 / 63
Now we will look over Destination effect on the delay.
flights_delay_dest <- flights15_prepared_cached %>%

count(DEST, STATUS) %>%
group_by(DEST) %>%
mutate( Percentage = round(Percentage *100,2) )
flights_delay_dest
# # Groups: DEST
# DEST STATUS n Percentage
# 1 ABE ONTIME 249 86.5
# 2 ABE DELAYED 39 13.5
# 3 ABQ ONTIME 2614 82.8
# 4 ABQ DELAYED 545 17.2
# 5 AGS ONTIME 532 82.0
# 6 AGS DELAYED 117 18.0
# 7 ALB DELAYED 357 18.3
# 8 ALB ONTIME 1593 81.7
# 9 ANC ONTIME 2141 75.8
# 10 ANC DELAYED 682 24.2
# # ... with more rows 36 / 63
flights_delay_dest %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(DEST = fct_reorder(DEST, Percentage)) %>%
ggplot(aes(x = DEST, y= Percentage)) +
#facet_wrap( STATUS) +
coord_flip()
37 / 63
Now we will look over Origin effect on the delay.
flights_delay_origin <- flights15_prepared_cached %>%

count(ORIGIN, STATUS) %>%
group_by(ORIGIN) %>%
mutate( Percentage = round(Percentage *100,2) )
flights_delay_origin
# # Groups: ORIGIN
# ORIGIN STATUS n Percentage
# 1 ABE ONTIME 265 91.7
# 2 ABE DELAYED 24 8.3
# 3 ABQ ONTIME 2720 86.3
# 4 ABQ DELAYED 433 13.7
# 5 AGS ONTIME 562 86.3
# 6 AGS DELAYED 89 13.7
# 7 ALB DELAYED 246 12.7
# 8 ALB ONTIME 1695 87.3
# 9 ANC ONTIME 2289 81.3
# 10 ANC DELAYED 525 18.7
# # ... with more rows 38 / 63
flights_delay_origin %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(ORIGIN = fct_reorder(ORIGIN, Percentage)) %>%
ggplot(aes(x = ORIGIN, y= Percentage)) +
coord_flip()
39 / 63
Analytic workflow with sparklyr
An analytic workflow with sparklyr might be composed of the following
stages.
For an example see Example Workflow.
1. Perform SQL queries through the sparklyr dplyr interface,
2. Use the sdf_* and ft_* family of functions to generate new columns, or
partition your data set,
3. Choose an appropriate machine learning algorithm from the ml_* family

of functions to model your data,
4. Inspect the quality of your model fit, and use it to make predictions with
new data.
Collect the results for visualization and further analysis in R
40 / 63
Data partitionning
sdf_random_split() could be used to partition the data
partitioned_flights <- flights15_prepared_cached %>%
sdf_random_split( training = 0.8,testing = 0.2 )
41 / 63
Fit the model to data
ml_* family functions
flights_model <- partitioned_flights$training %>%

ml_logistic_regression (DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR )
summary(flights_model)
# Coefficients:
# (Intercept) DEP_DELAY DISTANCE DEP_HOUR
# -2.96957935415 0.12868564313 -0.00007834523 -0.00301209314
42 / 63
Run predictions in Spark
Quick review of running predictions and reviewing accuracy
predictions <- ml_predict( flights_model, partitioned_flights$testing )
predictions %>% select(label ,prediction,probability_0,probability_1)

# label prediction probability_0 probability_1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1.95e-15 1.00
# 2 1 1 3.07e- 9 1.00
# 3 0 0 9.90e- 1 0.00998
# 4 0 0 9.77e- 1 0.0230
# 5 0 0 8.86e- 1 0.114
# 6 0 0 9.82e- 1 0.0182
# 7 0 0 9.43e- 1 0.0569
# 8 0 0 9.79e- 1 0.0208
# 9 0 0 9.08e- 1 0.0917
# 10 0 0 9.83e- 1 0.0169
43 / 63
Your turn 8
Partition the flights15_prepared_cached datframe using
sdf_random_split()
Fit logistic_regression to training partition
Use ml_predict to make predictions on the testing set
Execute the last chunk to compute AUC
--
partitioned_flights <- flights15_prepared_cached %>%
flights_model <- partitioned_flights$training %>%

ml_logistic_regression (DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR )
predictions <- ml_predict( flights_model, partitioned_flights$testing )
predictions
ml_binary_classification_evaluator(predictions, metric_name = "areaUnder

44 / 63
Spark ML Pipelines
-Spark’s ML Pipelines provide a way to easily combine multiple transformations
and algorithms into a single workflow, or pipeline.
-For R users, the insights gathered during the interactive sessions with Spark
can now be converted to a formal pipeline.
-This makes the hand-off from Data Scientists to Big Data Engineers a lot
easier.
45 / 63
Spark ML Pipelines
-The final list of selected variables, data manipulation, feature transformations
and modeling can be easily re-written into a ml_pipeline() object, saved,
and ultimately placed into a Production environment.
-The sparklyr output of a saved Spark ML Pipeline object is in Scala code,
which means that the code can be added to the scheduled Spark ML jobs, and
without any
dependencies in R.
flights_pipeline <- ml_pipeline(sc)
46 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
-Pipelines make heavy use of Feature Transformers.
-These functions use the Spark API directly to transform the data,
and may be faster at making the data manipulations that a dplyr (SQL)
transformation.
-In sparklyr the ft functions are essentially are wrappers to original Spark
feature
transformer.
47 / 63
Spark ML Pipelines
-We will start with dplyr transformations, which are ultimately SQL
transformations, loaded into the df variable.
-In sparklyr, there is one feature transformer that is not available in Spark,
ft_dplyr_transformer().
-this transformer starts by extracting the dplyr transformations from the tbl
object as a SQL statement then pipe it to the ft_sql_transformer().
-The goal of this function is to convert the dplyr code to a SQL Feature
Transformer that can then be used in a Pipeline.
48 / 63
Spark ML Pipelines
ft_dplyr_transformer
df <- flights15_tbl %>%
) %>%
select(ARR_DELAY,DEP_DELAY, MONTH, DAY_OF_WEEK, DISTANCE ,DEP_HOUR , D
49 / 63
Spark ML Pipelines
FT_DPLYR_TRANSFORMER
flights_pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer( tbl = df)
50 / 63
Spark ML Pipelines
ft_dplyr_transformer( tbl = df) %>%
ft_binarizer( input_col = "ARR_DELAY",

threshold = 15) %>%
ft_r_formula(DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN + DES
ml_logistic_regression()
51 / 63
Spark ML Pipelines
flights_pipeline
# Pipeline (Estimator) with 4 stages
# <pipeline__62ac7cd1_0549_42ea_9752_59a1c3c356a6>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__98bbbc43_cc8c_4fa6_801b_8618acdcc5a8>
# | (Parameters -- Column Names)
# |--2 Binarizer (Transformer)
# | <binarizer__ff84263c_96c3_49e5_b000_3e1b555b1945>
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormula (Estimator)
# | <r_formula__fd48b47b_e719_4bf7_a38d_eb3811f8f7c3>
# | features_col: features
# | label_col: label
# | (Parameters)
# | force_index_label: FALSE
# | formula: DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN +
# | handle_invalid: error
# | stringIndexerOrderType: frequencyDesc
52 / 63
# |--4 LogisticRegression (Estimator)
Your turn 9
Recreate the code to process data using dplyr transformations and assign
it to a new
variable called df
Start a new pipeline flights_pipeline with ml_pipeline() to combine

data manipulation,
feature transformations and algorithm into a single workflow or pipeline .
Pipe flights_pipeline to ft_dplyr_transformer(tbl = df).
ft_dplyr_transformer converts dplyr code to a SQL Feature Transformer

that will then be used in the Pipeline.
df <- flights15_tbl %>%

53 / 63
Your turn 9
Complete the flights_pipeline by introducing :
ft_binarizer() to determine if ARR_DELAY is over 15 minutes.

ft_r_formula() .
logistic regression model with ml_logistic_regression() function.
ft_dplyr_transformer( tbl = df) %>%
ft_binarizer( input_col = "ARR_DELAY",

threshold = 15) %>%
ft_r_formula(DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN + DES
ml_logistic_regression()
flights_pipeline
54 / 63
Data partitionning
sdf_random_split() to partition the data
partitioned_flights <- flights15_tbl %>%
55 / 63
Fit the ML Pipeline
The ml_fit() function produces the Pipeline Model.
The training partition of the partitioned_flights data is used to train the model:
fitted_pipeline <- ml_fit(flights_pipeline,

partitioned_flights$training )
56 / 63
Fit the ML Pipeline
Notice that the print-out for the fitted pipeline now displays the model’s
coefficients.
fitted_pipeline
# PipelineModel (Transformer) with 4 stages
# <pipeline__f2c230f2_69dd_4839_a372_54ced3c35148>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__6b2f03bc_ec74_4988_906d_ed5b91b10b65>
# |--2 Binarizer (Transformer)
# | <binarizer__9c8f1ce4_ab4b_4609_9a42_70e2a2842b77>
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormulaModel (Transformer)
# | <r_formula__2388ea95_1517_4dbf_b273_d5f52e7b652d>
# | features_col: features
# | label_col: label
# | (Transformer Info)
# | formula: chr "DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + OR
# |--4 LogisticRegressionModel (Transformer) 57 / 63
Make prediction using the
fitted Pipeline
The ml_predict() function can be used to run predictions.
predictions <- ml_predict(fitted_pipeline, partitioned_flights$testing )
58 / 63
Evaluate model performance
predictions %>%
group_by(DELAYED, prediction) %>%
tally()
# # Groups: DELAYED
# DELAYED prediction n
# <dbl> <dbl> <dbl>
# 1 0 1 5195
# 2 0 0 284697
# 3 1 1 41794
# 4 1 0 18691
ml_binary_classification_evaluator(predictions, metric_name = "areaUnder

# [1] 0.9258695
59 / 63
Your turn 10
Fit the flights_pipeline pipeline using the training data and ml_fit() function .
Use the fitted model to perform predictions using testing data and
ml_predict() function .
Check model performeance by computing the AUC.
partitioned_flights <- flights15_tbl %>%

fitted_pipeline <- ml_fit(flights_pipeline,

partitioned_flights$training )
predictions <- ml_predict(fitted_pipeline, partitioned_flights$testing )
predictions
# ARR_DELAY DEP_DELAY MONTH DAY_OF_WEEK DISTANCE DEP_HOUR
# <dbl> <dbl> <chr> <chr> <dbl> <dbl>
# 1 -17 -6 1 4 2475 12
# 2 195 178 1 4 3711 15
# 3 109 108 1 4 3784 19
# 4 6 -11 1 4 2475 6 60 / 63
Save the fitted pipeline
The ml_save() command can be used to save the Pipeline and PipelineModel
to disk.
The resulting output is a folder with the selected name, which contains all of
the necessary Scala scripts:
ml_save(flights_pipeline, "saved_pipeline", overwrite = TRUE )
61 / 63
Lode the fitted pipeline
The ml_load() command can be used to re-load Pipelines and
PipelineModels.
The saved ML Pipeline files can only be loaded into an open Spark session.
A simple query can be used as the table that will be used to make the new
predictions.
This of course, does not have to be done in R, at this time the “flights_model”
can be
loaded into an independent Spark session outside of R.
62 / 63
Disconnect from the cluster
After you are done processing data, you should terminates the connection to
the cluster as
well as the cluster tasks.
spark_disconnect(sc)
63 / 63

Intro To Analytics and ML With Sparklyr

Uploaded by

Intro To Analytics and ML With Sparklyr

Uploaded by

Introduction to analytics and

Use spark_connect() function to connect to a local instance of Spark:

sc <- spark_connect(master = "local")

The returned Spark connection(sc)provides a remote dplyr data source to the

nycflights13_tbl <- tbl(sc,"nycflights13_spark")

nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar

Use src_tbls to list all tbls provided by the remote source

sc <- spark_connect(master = "local")

nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar

tailnum_delay <- nycflights13_tbl %>%

summarise(delay = mean(arr_delay), n = n() ) %>%

filter(n > 100)

the results from the remote data source.

Behind the scenes, dplyr is translating your R code into SQL.

You can see the SQL it’s generating with show_query():

tailnum_delay %>% show_query()

tailnum_delay %>% collect() #collect() retrieves data into a local tibb

SELECT carrier, month , COUNT(*) as n

Table: Displaying records 1 - 185

Show the SQL produced by dplyr with show_query()

tailnum_delay <- nycflights13_tbl %>%

summarise(delay = mean(arr_delay), n = n() ) %>%

filter(n > 100)

tailnum_delay %>% show_query()

Apache Parquet is a columnar storage

Parquet Videos (more presentations (/presentations))

0605 Efficient Data Storage for Analytics with …

compatible with dplyr.

This function returns a reference to a Spark DataFrame which can be used

as a dplyr table (tbl)

flights15_tbl <- spark_read_parquet(sc,

memory : Boolean argument which indicates if the data should be loaded

flights15_tbl %>% count()

Or the number of flights by CARRIER in this dataset

flights15_tbl %>% count(CARRIER)

We should keep only those where we have valid readings of ARR_DELAY.

In order to analyse ARR_DELAY, We will keep only those flight records

where it did not get cancelled.

extracted hour value from DEP_TIME column.

and provides a data source compatible with dplyr(flights15_tbl).

Display the number of flights by CARRIER in this dataset .

Compute the number of NA values in ARR_DELAY column .

flights15_tbl <- spark_read_parquet(sc,

flights15_tbl %>% count(CARRIER)

keep only records where we have valid readings of ARR_DELAY.

filter out records related to Cancelled flighs.

flights15_prepared_tbl <- flights15_tbl %>%

mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%

Spark provides feature transformers,facilitating many common

Let us introduce a new column called DELAYED which has value 1

if the arrival delay(ARR_DELAY) is more than 15 minutes and 0 if ARR_DELAY is

flights15_prepared_tbl <- flights15_prepared_tbl %>%

Column (output_col) is called DELAYED and has value 1 if

the arrival delay(ARR_DELAY) is more than 15 minutes and 0

if ARR_DELAY is less than 15 minutes.

flights15_prepared_tbl <- flights15_prepared_tbl %>%

flights15_prepared_tbl <- flights15_prepared_tbl %>%

sdf_register will register the resulting Spark SQL in Spark.

The results will show up as a table called flights15_prepared_spark,but a table

Operations on cached tables should normally (although not always)

be more performant than the same operation performed on an uncached

so any analysis from there on will not need to re-read and

re-transform the original file. The resulting Spark RDD is

flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")

re-transform the original file.

sdf_register : Registers a Spark DataFrame (giving it a table name for the

and returns a tbl_spark.

flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")

flights_delay_perc <- flights15_prepared_cached %>%

flights_delay_day <- flights15_prepared_cached %>%

the percentage of delayed flights

Produce a graph of your choice to visualize the results

flights_delay_day <- flights15_prepared_cached %>%

count(STATUS, DAY_OF_WEEK) %>%

mutate(Percentage = n / sum(n )) %>%

mutate( Percentage = round(Percentage *100,2) ) %>%