0% found this document useful (0 votes)
83 views63 pages

Intro To Analytics and ML With Sparklyr

Here are the steps to read the parquet file and explore the data: 1. Read the parquet file into a Spark DataFrame using spark_read_parquet() 2. List the tables using src_tbls() 3. Count the total number of rows 4. Filter for non-cancelled flights 5. Create a new column DEP_HOUR from DEP_TIME flights15_tbl <- spark_read_parquet(sc, "data/airline_data_2015/parquet", memory=FALSE) src_tbls(sc) flights15_tbl %>% filter(CANCELLED != 1) %>% count() flights

Uploaded by

Nora Habrich
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
83 views63 pages

Intro To Analytics and ML With Sparklyr

Here are the steps to read the parquet file and explore the data: 1. Read the parquet file into a Spark DataFrame using spark_read_parquet() 2. List the tables using src_tbls() 3. Count the total number of rows 4. Filter for non-cancelled flights 5. Create a new column DEP_HOUR from DEP_TIME flights15_tbl <- spark_read_parquet(sc, "data/airline_data_2015/parquet", memory=FALSE) src_tbls(sc) flights15_tbl %>% filter(CANCELLED != 1) %>% count() flights

Uploaded by

Nora Habrich
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 63

Introduction to analytics and

ML
with Sparklyr
Lotfi NAJDI
2 / 63
Connecting to Spark
We can connect to both local instances of Spark as well as remote Spark

clusters.

Use spark_connect() function to connect to a local instance of Spark:

sc <- spark_connect(master = "local")

The returned Spark connection(sc)provides a remote dplyr data source to the

Spark cluster.

3 / 63
Using dplyr verbs
copy_to
We’ll start by using copy_to to upload a local data frame (R) into a remote data

source(Spark cluster)

copy_to(sc,nycflights13::flights, "nycflights13_spark")

4 / 63
Using dplyr verbs
src_tbls
We can use src_tbls to list all tbls provided by the remote source

src_tbls(sc)
# [1] "nycflights13_spark"

5 / 63
Using dplyr verbs
tbl
Now that we’ve copied the data, we can use tbl() to take a reference to it:

nycflights13_tbl <- tbl(sc,"nycflights13_spark")

nycflights13_tbl
# # Source: spark<nycflights13_spark> [?? x 19]
# year month day dep_time sched_dep_time dep_delay
# <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 517 515 2
# 2 2013 1 1 533 529 4
# 3 2013 1 1 542 540 2
# 4 2013 1 1 544 545 -1
# 5 2013 1 1 554 600 -6
# 6 2013 1 1 554 558 -4
# 7 2013 1 1 555 600 -5
# 8 2013 1 1 557 600 -3
# 9 2013 1 1 557 600 -3
# 10 2013 1 1 558 600 -2
# # ... with more rows, and 13 more variables:
# # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, 6 / 63
Using dplyr verbs
We can create spark data frame and also take a reference to it at the same
time :

nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar

7 / 63
Your turn 1
Use spark_connect() function to connect to a local instance of Spark:

Upload a local data frame (R) into a remote data using copy_to

Use src_tbls to list all tbls provided by the remote source

sc <- spark_connect(master = "local")

nycflights13_tbl <- copy_to(sc,nycflights13::flights, "nycflights13_spar

src_tbls(sc)

8 / 63
Using dplyr verbs
to analyse data
Let's try to analyse data on flight delays from nycflights13_spark :

tailnum_delay <- nycflights13_tbl %>%

group_by(tailnum) %>%

summarise(delay = mean(arr_delay), n = n() ) %>%

arrange(desc(delay)) %>%

filter(n > 100)

The sequence of operations is not executed until we ask for the data

(e.g., by printing tailnum_delay), that dplyr generates the SQL and requests

the results from the remote data source.

Even then it tries to do as little work as possible and only pulls down a few
rows. 9 / 63
Using dplyr verbs
to analyse data

tailnum_delay
# # Source: spark<?> [?? x 3]
# # Ordered by: desc(delay)
# tailnum delay n
# <chr> <dbl> <dbl>
# 1 N14228 3.71 111
# 2 N24211 7.7 130
# 3 N793JB 4.72 283
# 4 N657JB 5.03 285
# 5 N730MQ 1.2 178
# 6 N9EAMQ 9.24 248
# 7 N705TW -7.09 293
# 8 N318NB -1.12 202
# 9 N627VA 7.53 120
# 10 N646JB 5.00 260
# # ... with more rows

10 / 63
Using dplyr verbs
show_query()

Behind the scenes, dplyr is translating your R code into SQL.

You can see the SQL it’s generating with show_query():

tailnum_delay %>% show_query()


# <SQL>
# SELECT *
# FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
# FROM `nycflights13_spark`
# GROUP BY `tailnum`) `q01`
# WHERE (`n` > 100.0)

11 / 63
Using dplyr verbs
collect()
Typically, you’ll iterate a few times before you figure out what data you need
from the database.

Once you’ve figured it out, use collect() to pull all the data down into a local
tibble:

tailnum_delay %>% collect() #collect() retrieves data into a local tibb


# # A tibble: 1,201 x 3
# tailnum delay n
# <chr> <dbl> <dbl>
# 1 N14228 3.71 111
# 2 N24211 7.7 130
# 3 N793JB 4.72 283
# 4 N657JB 5.03 285
# 5 N730MQ 1.2 178
# 6 N9EAMQ 9.24 248
# 7 N705TW -7.09 293
# 8 N318NB -1.12 202
# 9 N627VA 7.53 120
# 10 N646JB 5.00 260
12 / 63
# # ... with 1,191 more rows
Using SQL
to analyse Saprk data
If you’re familiar with SQL, you can write SQL directly in order to analyse the
data.

For example to count the number of flights by carrier and month the spark
data frame nycflights13_spark.

nycflights13_tbl %>%
count(carrier,month) %>%
arrange(-n)

SELECT carrier, month , COUNT(*) as n


FROM nycflights13_spark
group by carrier, month
order by n desc

Table: Displaying records 1 - 185


|carrier | month| n|
|:-------|-----:|----:|
|UA | 8|
5124|
|UA | 7| 5066|
|UA | 10| 5060|
|UA | 4| 5047|
|B6 | 7| 4984|
|UA | 6| 4975|
|UA |
3| 4971|
|UA | 5| 4960|
|B6 | 8| 4952|
|UA | 12| 4931| 13 / 63
Your turn 2
Compute the mean and the number of delay by tailnum (plane)

Show the SQL produced by dplyr with show_query()

use collect() to pull all the data down into a local tibble:

tailnum_delay <- nycflights13_tbl %>%

group_by(tailnum) %>%

summarise(delay = mean(arr_delay), n = n() ) %>%

arrange(desc(delay)) %>%

filter(n > 100)

tailnum_delay %>% show_query()

tailnum_delay

14 / 63
Reading Data
Example of parquet file
Apache Software Foundation
/  Apache Parquet (http://parquet.apache.org)

Apache Parquet is a columnar storage


(http://en.wikipedia.org/wiki/Column-oriented_DBMS) format available
to any project in the Hadoop ecosystem, regardless of the choice of
data processing framework, data model or programming language.

Parquet Videos (more presentations (/presentations))

0605 Efficient Data Storage for Analytics with …

15 / 63
Reading Data
Example of parquet file

16 / 63
Reading Data
Example of parquet file
spark_read_parquet reads a parquet file and provides a data source

compatible with dplyr.

This function returns a reference to a Spark DataFrame which can be used

as a dplyr table (tbl)

flights15_tbl <- spark_read_parquet(sc,


name = "flights15_spark",
path = "data/airline_data_2015/parquet/par
memory = FALSE)

memory : Boolean argument which indicates if the data should be loaded


eagerly into memory? (That is, should the table be cached?)
We can use
src_tbls to list all tbls provided by the remote source

src_tbls(sc) 17 / 63
Explore Data
Use dplyr verb to count the total number of flights in this dataset

flights15_tbl %>% count()


# # Source: spark<?> [?? x 1]
# n
# <dbl>
# 1 1782884

Or the number of flights by CARRIER in this dataset

flights15_tbl %>% count(CARRIER)


# # Source: spark<?> [?? x 2]
# CARRIER n
# <chr> <dbl>
# 1 DL 746017
# 2 UA 424022
# 3 AA 612845

18 / 63
Explore Data
We can see that there are lot of NA values in

ARR_DELAY column.

flights15_tbl %>%
filter(is.na(ARR_DELAY)) %>%
count()
# # Source: spark<?> [?? x 1]
# n
# <dbl>
# 1 25293

We should keep only those where we have valid readings of ARR_DELAY.

19 / 63
Explore Data
flights15_tbl %>%
filter(CANCELLED == 1 )
# # Source: spark<?> [?? x 16]
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 2015 1 27 2 AA 1
# 2 2015 1 28 3 AA 1
# 3 2015 1 26 1 AA 2
# 4 2015 1 27 2 AA 2
# 5 2015 1 27 2 AA 3
# 6 2015 1 28 3 AA 3
# 7 2015 1 26 1 AA 4
# 8 2015 1 27 2 AA 4
# 9 2015 1 1 4 AA 6
# 10 2015 1 3 6 AA 6
# # ... with more rows, and 10 more variables: ORIGIN <chr>,
# # DEST <chr>, DEP_TIME <dbl>, DEP_DELAY <dbl>,
# # ARR_TIME <dbl>, ...

In order to analyse ARR_DELAY, We will keep only those flight records

where it did not get cancelled.


20 / 63
Explore Data
flights15_tbl %>% select(DEP_TIME)
# # Source: spark<?> [?? x 1]
# DEP_TIME
# <dbl>
# 1 855
# 2 850
# 3 853
# 4 853
# 5 853
# 6 856
# 7 859
# 8 856
# 9 901
# 10 903
# # ... with more rows

We should also create a new column called DEP_HOUR which will have

extracted hour value from DEP_TIME column.

21 / 63
Your turn 3
Use the appropriate function to read the parquet file for 2015 flights

and provides a data source compatible with dplyr(flights15_tbl).

Display the number of flights by CARRIER in this dataset .

Compute the number of NA values in ARR_DELAY column .

flights15_tbl <- spark_read_parquet(sc,


name = "flights15_spark",
path = "data/airline_data_2015/parqu
memory = FALSE)

flights15_tbl %>%
filter(is.na(ARR_DELAY)) %>%
count()

flights15_tbl %>% count(CARRIER)

22 / 63
Data preparation
flights15_prepared_tbl <- flights15_tbl %>%
filter(!is.na(ARR_DELAY)) %>%
filter(CANCELLED ==0 ) %>%
mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%

mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
)

23 / 63
Your turn 4
Complete the following chunk in order to :

keep only records where we have valid readings of ARR_DELAY.

filter out records related to Cancelled flighs.

Create a new column called DEP_HOUR which will have extracted hour
value from DEP_TIME column.

flights15_prepared_tbl <- flights15_tbl %>%

filter(is.na(ARR_DELAY)) %>%

filter(CANCELLED == 0 ) %>%

mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%

mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
)

24 / 63
Data preparation
This time we will use ft_binarizer() to identify “delayed” flights.

Spark provides feature transformers,facilitating many common


transformations

of data within a Spark DataFrame, and sparklyr exposes these within the ft_*

family of functions.

Let us introduce a new column called DELAYED which has value 1

if the arrival delay(ARR_DELAY) is more than 15 minutes and 0 if ARR_DELAY is


less than 15 minutes.

That means all flights which are arrived 15 minutes delayed are considered to
be
delayed.

flights15_prepared_tbl <- flights15_prepared_tbl %>%


ft_binarizer(
input_col = "ARR_DELAY",
output_col = "DELAYED",
threshold = 15
) %>%
25 / 63
mutate(STATUS = ifelse(DELAYED ==1 , "DELAYED", "ONTIME"))
Your turn 5
Using ft_binarizer() , try to identify “delayed” flights, the new

Column (output_col) is called DELAYED and has value 1 if

the arrival delay(ARR_DELAY) is more than 15 minutes and 0

if ARR_DELAY is less than 15 minutes.

Let's introduce new variable called STATUS with two values "DELAYED" and
"ONTIME".

flights15_prepared_tbl <- flights15_prepared_tbl %>%


ft_binarizer(
input_col = "ARR_DELAY",
output_col = "DELAYED",
threshold = 15
)

flights15_prepared_tbl <- flights15_prepared_tbl %>%


mutate(STATUS = ifelse(DELAYED ==1 , "DELAYED", "ONTIME"))

26 / 63
Spark Caching
sdf_register
Registers a Spark DataFrame (giving it a table name for the Spark SQL
context), and returns a tbl_spark.

sdf_register will register the resulting Spark SQL in Spark.

The results will show up as a table called flights15_prepared_spark,but a table


of the same name is still not loaded into memory in Spark.

sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")
# # Source: spark<flights15_prepared_spark> [?? x 19]
# YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK CARRIER FL_NUM
# <dbl> <chr> <dbl> <chr> <chr> <dbl>
# 1 2015 1 1 4 AA 1
# 2 2015 1 2 5 AA 1
# 3 2015 1 3 6 AA 1
# 4 2015 1 4 7 AA 1
# 5 2015 1 5 1 AA 1
# 6 2015 1 6 2 AA 1
# 7 2015 1 7 3 AA 1
# 8 2015 1 8 4 AA 1
27 / 63
# 9 2015 1 9 5 AA 1
Spark Caching
tbl_cache
Force a Spark table with name name to be loaded into memory.

Operations on cached tables should normally (although not always)

be more performant than the same operation performed on an uncached


table.

The tbl_cache command loads the results into an Spark RDD in memory,

so any analysis from there on will not need to re-read and

re-transform the original file. The resulting Spark RDD is

smaller than the original file because the transformations created a smaller
data set than the

original file.

#tbl_uncache(sc, "flights_data")
tbl_cache(sc, "flights15_prepared_spark")
28 / 63
Spark Caching
tbl_cache

src_tbls(sc)
# [1] "flights15_prepared_spark" "flights15_spark"
# [3] "nycflights13_spark"

flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")

29 / 63
Your turn 6
Using sdf_register and tbl_cache commands, load the results form
flights15_prepared_tbl

into an Spark RDD in memory, so any analysis from there on will not need to
re-read and

re-transform the original file.

sdf_register : Registers a Spark DataFrame (giving it a table name for the


Spark SQL context),

and returns a tbl_spark.

tbl_cache : Force a Spark table with name name to be loaded into memory.

sdf_register(flights15_prepared_tbl,"flights15_prepared_spark")

flights15_prepared_cached <- tbl(sc,"flights15_prepared_spark")

30 / 63
visualizing the data
Before creating a model, let's start by visualizing the data.

flights_delay_perc <- flights15_prepared_cached %>%


count(STATUS) %>%
mutate(Percentage = n / sum(n )) %>%
mutate( Percentage = round(Percentage * 100 ,2) ) %>%
collect()
flights_delay_perc
# # A tibble: 2 x 3
# STATUS n Percentage
# <chr> <dbl> <dbl>
# 1 ONTIME 1455442 82.8
# 2 DELAYED 302149 17.2

31 / 63
visualizing the data
flights_delay_perc %>%
ggplot(aes(x = STATUS, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
coord_flip()

32 / 63
visualizing the data
Let us explore the effect of Day_Of_Week has on Delay.

flights_delay_day <- flights15_prepared_cached %>%


count(STATUS, DAY_OF_WEEK) %>%
group_by(DAY_OF_WEEK) %>%
mutate(Percentage = n / sum(n )) %>%
mutate( Percentage = round(Percentage *100,2) ) %>%
collect()
flights_delay_day
# # A tibble: 14 x 4
# STATUS DAY_OF_WEEK n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ONTIME 1 214466 81.2
# 2 DELAYED 1 49732 18.8
# 3 DELAYED 2 44510 16.9
# 4 ONTIME 2 218690 83.1
# 5 DELAYED 3 44142 16.8
# 6 ONTIME 3 218404 83.2
# 7 ONTIME 4 211771 81.1
# 8 DELAYED 4 49442 18.9
# 9 ONTIME 5 212147 82.8
# 10 DELAYED 5 44103 17.2
# # ... with 4 more rows
33 / 63
visualizing the data
Let us explore the effect of Day_Of_Week has on Delay.

library(ggrepel)
flights_delay_day %>% ggplot( aes(x=DAY_OF_WEEK,y=Percentage, fill = STA
geom_col(position = "stack")+
geom_path(aes( color = STATUS ) ) +
geom_text_repel(aes(label=Percentage), size = 3) +
ggtitle("Percentage of Flights Delayed") +
labs(title = "Percentage of Flights Delayed" , x="Day of Week",y="Per

34 / 63
Your turn 7
To explore the effect of Day_Of_Week on Delay,compute

the percentage of delayed flights

Produce a graph of your choice to visualize the results

flights_delay_day <- flights15_prepared_cached %>%

count(STATUS, DAY_OF_WEEK) %>%

group_by(DAY_OF_WEEK) %>%

mutate(Percentage = n / sum(n )) %>%

mutate( Percentage = round(Percentage *100,2) ) %>%

collect()

flights_delay_day

35 / 63
visualizing the data
Now we will look over Destination effect on the delay.

flights_delay_dest <- flights15_prepared_cached %>%


count(DEST, STATUS) %>%
group_by(DEST) %>%
mutate(Percentage = n / sum(n )) %>%
mutate( Percentage = round(Percentage *100,2) )

flights_delay_dest
# # Source: spark<?> [?? x 4]
# # Groups: DEST
# DEST STATUS n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ABE ONTIME 249 86.5
# 2 ABE DELAYED 39 13.5
# 3 ABQ ONTIME 2614 82.8
# 4 ABQ DELAYED 545 17.2
# 5 AGS ONTIME 532 82.0
# 6 AGS DELAYED 117 18.0
# 7 ALB DELAYED 357 18.3
# 8 ALB ONTIME 1593 81.7
# 9 ANC ONTIME 2141 75.8
# 10 ANC DELAYED 682 24.2
# # ... with more rows 36 / 63
visualizing the data
flights_delay_dest %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(DEST = fct_reorder(DEST, Percentage)) %>%
ggplot(aes(x = DEST, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
#facet_wrap( STATUS) +
coord_flip()

37 / 63
visualizing the data
Now we will look over Origin effect on the delay.

flights_delay_origin <- flights15_prepared_cached %>%


count(ORIGIN, STATUS) %>%
group_by(ORIGIN) %>%
mutate(Percentage = n / sum(n )) %>%
mutate( Percentage = round(Percentage *100,2) )

flights_delay_origin
# # Source: spark<?> [?? x 4]
# # Groups: ORIGIN
# ORIGIN STATUS n Percentage
# <chr> <chr> <dbl> <dbl>
# 1 ABE ONTIME 265 91.7
# 2 ABE DELAYED 24 8.3
# 3 ABQ ONTIME 2720 86.3
# 4 ABQ DELAYED 433 13.7
# 5 AGS ONTIME 562 86.3
# 6 AGS DELAYED 89 13.7
# 7 ALB DELAYED 246 12.7
# 8 ALB ONTIME 1695 87.3
# 9 ANC ONTIME 2289 81.3
# 10 ANC DELAYED 525 18.7
# # ... with more rows 38 / 63
visualizing the data
flights_delay_origin %>%
filter(STATUS == "DELAYED" ) %>%
arrange(desc(Percentage)) %>%
head(10) %>%
collect() %>%
mutate(ORIGIN = fct_reorder(ORIGIN, Percentage)) %>%
ggplot(aes(x = ORIGIN, y= Percentage)) +
geom_col(fill = "#2150B5" ) +
geom_text(aes(y = Percentage/2, label = paste0(Percentage,"%")), colou
coord_flip()

39 / 63
Analytic workflow with sparklyr
An analytic workflow with sparklyr might be composed of the following
stages.

For an example see Example Workflow.

1. Perform SQL queries through the sparklyr dplyr interface,

2. Use the sdf_* and ft_* family of functions to generate new columns, or
partition your data set,

3. Choose an appropriate machine learning algorithm from the ml_* family


of functions to model your data,

4. Inspect the quality of your model fit, and use it to make predictions with
new data.
Collect the results for visualization and further analysis in R

40 / 63
Data partitionning
sdf_random_split() could be used to partition the data

partitioned_flights <- flights15_prepared_cached %>%

sdf_random_split( training = 0.8,testing = 0.2 )

41 / 63
Fit the model to data
ml_* family functions

flights_model <- partitioned_flights$training %>%


ml_logistic_regression (DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR )

summary(flights_model)
# Coefficients:
# (Intercept) DEP_DELAY DISTANCE DEP_HOUR
# -2.96957935415 0.12868564313 -0.00007834523 -0.00301209314

42 / 63
Run predictions in Spark
Quick review of running predictions and reviewing accuracy

predictions <- ml_predict( flights_model, partitioned_flights$testing )

predictions %>% select(label ,prediction,probability_0,probability_1)


# # Source: spark<?> [?? x 4]
# label prediction probability_0 probability_1
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1.95e-15 1.00
# 2 1 1 3.07e- 9 1.00
# 3 0 0 9.90e- 1 0.00998
# 4 0 0 9.77e- 1 0.0230
# 5 0 0 8.86e- 1 0.114
# 6 0 0 9.82e- 1 0.0182
# 7 0 0 9.43e- 1 0.0569
# 8 0 0 9.79e- 1 0.0208
# 9 0 0 9.08e- 1 0.0917
# 10 0 0 9.83e- 1 0.0169
# # ... with more rows

43 / 63
Your turn 8
Partition the flights15_prepared_cached datframe using
sdf_random_split()
Fit logistic_regression to training partition
Use ml_predict to make predictions on the testing set
Execute the last chunk to compute AUC

--

partitioned_flights <- flights15_prepared_cached %>%

sdf_random_split( training = 0.8,testing = 0.2 )

flights_model <- partitioned_flights$training %>%


ml_logistic_regression (DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR )

predictions <- ml_predict( flights_model, partitioned_flights$testing )

predictions

ml_binary_classification_evaluator(predictions, metric_name = "areaUnder


44 / 63
Spark ML Pipelines
-Spark’s ML Pipelines provide a way to easily combine multiple transformations

and algorithms into a single workflow, or pipeline.

-For R users, the insights gathered during the interactive sessions with Spark

can now be converted to a formal pipeline.

-This makes the hand-off from Data Scientists to Big Data Engineers a lot
easier.

45 / 63
Spark ML Pipelines
-The final list of selected variables, data manipulation, feature transformations

and modeling can be easily re-written into a ml_pipeline() object, saved,

and ultimately placed into a Production environment.

-The sparklyr output of a saved Spark ML Pipeline object is in Scala code,

which means that the code can be added to the scheduled Spark ML jobs, and
without any

dependencies in R.

flights_pipeline <- ml_pipeline(sc)

46 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
-Pipelines make heavy use of Feature Transformers.

-These functions use the Spark API directly to transform the data,

and may be faster at making the data manipulations that a dplyr (SQL)
transformation.

-In sparklyr the ft functions are essentially are wrappers to original Spark
feature

transformer.

47 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
-We will start with dplyr transformations, which are ultimately SQL
transformations, loaded into the df variable.

-In sparklyr, there is one feature transformer that is not available in Spark,
ft_dplyr_transformer().

-this transformer starts by extracting the dplyr transformations from the tbl
object as a SQL statement then pipe it to the ft_sql_transformer().

-The goal of this function is to convert the dplyr code to a SQL Feature
Transformer that can then be used in a Pipeline.

48 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
ft_dplyr_transformer

df <- flights15_tbl %>%

filter(!is.na(ARR_DELAY)) %>%

filter(CANCELLED ==0 ) %>%

mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%

mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
) %>%
select(ARR_DELAY,DEP_DELAY, MONTH, DAY_OF_WEEK, DISTANCE ,DEP_HOUR , D

49 / 63
Spark ML Pipelines
FEATURE TRANSFORMERS
FT_DPLYR_TRANSFORMER

flights_pipeline <- ml_pipeline(sc) %>%

ft_dplyr_transformer( tbl = df)

50 / 63
Spark ML Pipelines
flights_pipeline <- ml_pipeline(sc) %>%

ft_dplyr_transformer( tbl = df) %>%

ft_binarizer( input_col = "ARR_DELAY",


output_col = "DELAYED",
threshold = 15) %>%

ft_r_formula(DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN + DES

ml_logistic_regression()

51 / 63
Spark ML Pipelines
flights_pipeline
# Pipeline (Estimator) with 4 stages
# <pipeline__62ac7cd1_0549_42ea_9752_59a1c3c356a6>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__98bbbc43_cc8c_4fa6_801b_8618acdcc5a8>
# | (Parameters -- Column Names)
# |--2 Binarizer (Transformer)
# | <binarizer__ff84263c_96c3_49e5_b000_3e1b555b1945>
# | (Parameters -- Column Names)
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormula (Estimator)
# | <r_formula__fd48b47b_e719_4bf7_a38d_eb3811f8f7c3>
# | (Parameters -- Column Names)
# | features_col: features
# | label_col: label
# | (Parameters)
# | force_index_label: FALSE
# | formula: DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN +
# | handle_invalid: error
# | stringIndexerOrderType: frequencyDesc
52 / 63
# |--4 LogisticRegression (Estimator)
Your turn 9
Recreate the code to process data using dplyr transformations and assign
it to a new

variable called df

Start a new pipeline flights_pipeline with ml_pipeline() to combine


data manipulation,

feature transformations and algorithm into a single workflow or pipeline .

Pipe flights_pipeline to ft_dplyr_transformer(tbl = df).

ft_dplyr_transformer converts dplyr code to a SQL Feature Transformer


that will then be used in the Pipeline.

df <- flights15_tbl %>%

filter(!is.na(ARR_DELAY)) %>%

filter(CANCELLED ==0 ) %>%

mutate(DEP_HOUR = floor(DEP_TIME/100 )) %>%


53 / 63
Your turn 9
Complete the flights_pipeline by introducing :

ft_binarizer() to determine if ARR_DELAY is over 15 minutes.


ft_r_formula() .
logistic regression model with ml_logistic_regression() function.

flights_pipeline <- ml_pipeline(sc) %>%

ft_dplyr_transformer( tbl = df) %>%

ft_binarizer( input_col = "ARR_DELAY",


output_col = "DELAYED",
threshold = 15) %>%

ft_r_formula(DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + ORIGIN + DES

ml_logistic_regression()

flights_pipeline

54 / 63
Data partitionning
sdf_random_split() to partition the data

partitioned_flights <- flights15_tbl %>%

sdf_random_split( training = 0.8,testing = 0.2 )

55 / 63
Fit the ML Pipeline
The ml_fit() function produces the Pipeline Model.

The training partition of the partitioned_flights data is used to train the model:

fitted_pipeline <- ml_fit(flights_pipeline,


partitioned_flights$training )

56 / 63
Fit the ML Pipeline
Notice that the print-out for the fitted pipeline now displays the model’s
coefficients.

fitted_pipeline
# PipelineModel (Transformer) with 4 stages
# <pipeline__f2c230f2_69dd_4839_a372_54ced3c35148>
# Stages
# |--1 SQLTransformer (Transformer)
# | <dplyr_transformer__6b2f03bc_ec74_4988_906d_ed5b91b10b65>
# | (Parameters -- Column Names)
# |--2 Binarizer (Transformer)
# | <binarizer__9c8f1ce4_ab4b_4609_9a42_70e2a2842b77>
# | (Parameters -- Column Names)
# | input_col: ARR_DELAY
# | output_col: DELAYED
# |--3 RFormulaModel (Transformer)
# | <r_formula__2388ea95_1517_4dbf_b273_d5f52e7b652d>
# | (Parameters -- Column Names)
# | features_col: features
# | label_col: label
# | (Transformer Info)
# | formula: chr "DELAYED ~ DEP_DELAY + DISTANCE + DEP_HOUR + OR
# |--4 LogisticRegressionModel (Transformer) 57 / 63
Make prediction using the
fitted Pipeline
The ml_predict() function can be used to run predictions.

predictions <- ml_predict(fitted_pipeline, partitioned_flights$testing )

58 / 63
Evaluate model performance
predictions %>%
group_by(DELAYED, prediction) %>%
tally()
# # Source: spark<?> [?? x 3]
# # Groups: DELAYED
# DELAYED prediction n
# <dbl> <dbl> <dbl>
# 1 0 1 5195
# 2 0 0 284697
# 3 1 1 41794
# 4 1 0 18691

ml_binary_classification_evaluator(predictions, metric_name = "areaUnder


# [1] 0.9258695

59 / 63
Your turn 10
Fit the flights_pipeline pipeline using the training data and ml_fit() function .

Use the fitted model to perform predictions using testing data and
ml_predict() function .

Check model performeance by computing the AUC.

partitioned_flights <- flights15_tbl %>%


sdf_random_split( training = 0.8,testing = 0.2 )

fitted_pipeline <- ml_fit(flights_pipeline,


partitioned_flights$training )

predictions <- ml_predict(fitted_pipeline, partitioned_flights$testing )

predictions
# # Source: spark<?> [?? x 14]
# ARR_DELAY DEP_DELAY MONTH DAY_OF_WEEK DISTANCE DEP_HOUR
# <dbl> <dbl> <chr> <chr> <dbl> <dbl>
# 1 -17 -6 1 4 2475 12
# 2 195 178 1 4 3711 15
# 3 109 108 1 4 3784 19
# 4 6 -11 1 4 2475 6 60 / 63
Save the fitted pipeline
The ml_save() command can be used to save the Pipeline and PipelineModel
to disk.

The resulting output is a folder with the selected name, which contains all of
the necessary Scala scripts:

ml_save(flights_pipeline, "saved_pipeline", overwrite = TRUE )

61 / 63
Lode the fitted pipeline
The ml_load() command can be used to re-load Pipelines and
PipelineModels.

The saved ML Pipeline files can only be loaded into an open Spark session.

A simple query can be used as the table that will be used to make the new
predictions.

This of course, does not have to be done in R, at this time the “flights_model”
can be

loaded into an independent Spark session outside of R.

62 / 63
Disconnect from the cluster
After you are done processing data, you should terminates the connection to
the cluster as

well as the cluster tasks.

spark_disconnect(sc)

63 / 63

You might also like