Mastering Apache Spark PDF
Mastering Apache Spark PDF
Apache Spark
Table of Contents
Introduction 0
Overview of Spark 1
Anatomy of Spark Application 2
SparkConf - Configuration for Spark Applications 2.1
SparkContext - the door to Spark 2.2
RDD - Resilient Distributed Dataset 2.3
Operators - Transformations and Actions 2.3.1
mapPartitions 2.3.1.1
Partitions and Partitioning 2.3.2
Caching and Persistence 2.3.3
Shuffling 2.3.4
Checkpointing 2.3.5
Dependencies 2.3.6
Types of RDDs 2.3.7
ParallelCollectionRDD 2.3.7.1
MapPartitionsRDD 2.3.7.2
PairRDDFunctions 2.3.7.3
CoGroupedRDD 2.3.7.4
HadoopRDD 2.3.7.5
ShuffledRDD 2.3.7.6
BlockRDD 2.3.7.7
Spark Tools 3
Spark Shell 3.1
WebUI - UI for Spark Monitoring 3.2
Executors Tab 3.2.1
spark-submit 3.3
spark-class 3.4
Spark Architecture 4
Driver 4.1
Master 4.2
2
Mastering Apache Spark
Workers 4.3
Executors 4.4
Spark Runtime Environment 5
DAGScheduler 5.1
Jobs 5.1.1
Stages 5.1.2
Task Scheduler 5.2
Tasks 5.2.1
TaskSets 5.2.2
TaskSetManager 5.2.3
TaskSchedulerImpl - Default TaskScheduler 5.2.4
Scheduler Backend 5.3
CoarseGrainedSchedulerBackend 5.3.1
Executor Backend 5.4
CoarseGrainedExecutorBackend 5.4.1
Shuffle Manager 5.5
Block Manager 5.6
HTTP File Server 5.7
Broadcast Manager 5.8
Dynamic Allocation 5.9
Data Locality 5.10
Cache Manager 5.11
Spark, Akka and Netty 5.12
OutputCommitCoordinator 5.13
RPC Environment (RpcEnv) 5.14
Netty-based RpcEnv 5.14.1
ContextCleaner 5.15
MapOutputTracker 5.16
ExecutorAllocationManager 5.17
Deployment Environments 6
Spark local 6.1
Spark on cluster 6.2
Spark Standalone 6.2.1
Master 6.2.1.1
3
Mastering Apache Spark
web UI 6.2.1.2
Management Scripts for Standalone Master 6.2.1.3
Management Scripts for Standalone Workers 6.2.1.4
Checking Status 6.2.1.5
Example 2-workers-on-1-node Standalone Cluster (one executor per
worker) 6.2.1.6
Spark on Mesos 6.2.2
Spark on YARN 6.2.3
Execution Model 7
Advanced Concepts of Spark 8
Broadcast variables 8.1
Accumulators 8.2
Security 9
Spark Security 9.1
Securing Web UI 9.2
Data Sources in Spark 10
Using Input and Output (I/O) 10.1
Spark and Parquet 10.1.1
Serialization 10.1.2
Using Apache Cassandra 10.2
Using Apache Kafka 10.3
Spark Application Frameworks 11
Spark Streaming 11.1
StreamingContext 11.1.1
Stream Operators 11.1.2
Windowed Operators 11.1.2.1
SaveAs Operators 11.1.2.2
Stateful Operators 11.1.2.3
web UI and Streaming Statistics Page 11.1.3
Streaming Listeners 11.1.4
Checkpointing 11.1.5
JobScheduler 11.1.6
JobGenerator 11.1.7
DStreamGraph 11.1.8
4
Mastering Apache Spark
5
Mastering Apache Spark
6
Mastering Apache Spark
7
Mastering Apache Spark
Books 16.2
Commercial Products using Apache Spark 17
IBM Analytics for Apache Spark 17.1
Google Cloud Dataproc 17.2
Spark Advanced Workshop 18
Requirements 18.1
Day 1 18.2
Day 2 18.3
Spark Talks Ideas (STI) 19
10 Lesser-Known Tidbits about Spark Standalone 19.1
Learning Spark internals using groupBy (to cause shuffle) 19.2
Glossary
8
Mastering Apache Spark
I’m Jacek Laskowski, an independent consultant who offers development and training
services for Apache Spark (and Scala, sbt with a bit of Apache Kafka, Apache Hive,
Apache Mesos, Akka Actors/Stream/HTTP, and Docker). I run Warsaw Scala Enthusiasts
and Warsaw Spark meetups.
If you like the notes you may consider participating in my own, very hands-on Spark and
Scala Workshop.
This collections of notes (what some may rashly call a "book") serves as the ultimate place
of mine to collect all the nuts and bolts of using Apache Spark. The notes aim to help me
designing and developing better products with Spark. It is also a viable proof of my
understanding of Apache Spark. I do eventually want to reach the highest level of mastery in
Apache Spark.
It may become a book one day, but surely serves as the study material for trainings,
workshops, videos and courses about Apache Spark. Follow me on twitter @jaceklaskowski
to know it early. You will also learn about the upcoming events about Apache Spark.
Expect text and code snippets from Spark’s mailing lists, the official documentation of
Apache Spark, StackOverflow, blog posts, books from O’Reilly, press releases,
YouTube/Vimeo videos, Quora, the source code of Apache Spark, etc. Attribution follows.
Introduction 9
Mastering Apache Spark
Overview of Spark
When you hear Apache Spark it can be two things - the Spark engine aka Spark Core or
the Spark project - an "umbrella" term for Spark Core and the accompanying Spark
Application Frameworks, i.e. Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX
that sit on top of Spark Core and the main data abstraction in Spark called RDD - Resilient
Distributed Dataset.
Why Spark
Let’s list a few of the many reasons for Spark. We are doing it first, and then comes the
overview that lends a more technical helping hand.
Diverse Workloads
As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark
Internals video (quoting with few changes):
Overview of Spark 10
Mastering Apache Spark
One of the Spark project goals was to deliver a platform that supports a very wide array
of diverse workflows - not only MapReduce batch jobs (there were available in
Hadoop already at that time), but also iterative computations like graph algorithms or
Machine Learning.
And also different scales of workloads from sub-second interactive jobs to jobs that run
for many hours.
Spark also supports near real-time streaming workloads via Spark Streaming application
framework.
ETL workloads and Analytics workloads are different, however Spark attempts to offer a
unified platform for a wide variety of workloads.
Graph and Machine Learning algorithms are iterative by nature and less saves to disk or
transfers over network means better performance.
You should watch the video What is Apache Spark? by Mike Olson, Chief Strategy Officer
and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its
rise in popularity in the open source community, and how Spark is primed to replace
MapReduce as the general processing engine in Hadoop.
Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on
YARN and HDFS - while improving on the performance and simplicity of the distributed
computing engine.
And it should not come as a surprise, without Hadoop MapReduce (its advances and
deficiencies), Spark would not have been born at all.
It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).
Overview of Spark 11
Mastering Apache Spark
So, when you have a need for distributed Collections API in Scala, Spark with RDD API
should be a serious contender.
It expanded on the available computation styles beyond the only map-and-reduce available
in Hadoop MapReduce.
It’s also very productive of Spark that teams can exploit the different skills the team
members have acquired so far. Data analysts, data scientists, Python programmers, or Java,
or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for
bringing skilled people with their expertise in different programming languages together to a
Spark project.
Interactive exploration
It is also called ad hoc queries.
Using the Spark shell you can execute computations to process large amount of data (The
Big Data). It’s all interactive and very useful to explore the data before final production
release.
Also, using the Spark shell you can access any Spark cluster as if it was your local machine.
Just point the Spark shell to a 20-node of 10TB RAM memory in total (using --master ) and
use all the components (and their abstractions) like Spark SQL, Spark MLlib, Spark
Streaming, and Spark GraphX.
Overview of Spark 12
Mastering Apache Spark
Depending on your needs and skills, you may see a better fit for SQL vs programming APIs
or apply machine learning algorithms (Spark MLlib) from data in graph data structures
(Spark GraphX).
Single environment
Regardless of which programming language you are good at, be it Scala, Java, Python or R,
you can use the same single clustered runtime environment for prototyping, ad hoc queries,
and deploying your applications leveraging the many ingestion data points offered by the
Spark platform.
You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark
SQL (DataFrames), Spark MLlib (Pipelines), Spark GraphX (???), or Spark Streaming
(DStreams).
The single programming model and execution engine for different kinds of workloads
simplify development and deployment architectures.
Both, input and output data sources, allow programmers and data engineers use Spark as
the platform with the large amount of data that is read from or saved to for processing,
interactively (using Spark shell) or in applications.
Spark embraces many concepts in a single unified development and runtime environment.
Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now
be used by Scala developers (as Pipeline API in Spark MLlib or calling pipe() ).
Overview of Spark 13
Mastering Apache Spark
This single platform gives plenty of opportunities for Python, Scala, Java, and R
programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise
data warehousesthe with Thrift JDBC/ODBC server in Spark SQL).
Mind the proverb if all you have is a hammer, everything looks like a nail, too.
Low-level Optimizations
Apache Spark uses a directed acyclic graph (DAG) of computation stages (aka execution
DAG). It postpones any processing until really required for actions. Spark’s lazy evaluation
gives plenty of opportunities to induce low-level optimizations (so users have to know less to
do more).
Many Machine Learning algorithms require plenty of iterations before the result models get
optimal, like logistic regression. The same applies to graph algorithms to traverse all the
nodes and edges when needed. Such computations can increase their performance when
the interim partial results are stored in memory or at very fast solid state drives.
Spark can cache intermediate data in memory for faster model building and training. Once
the data is loaded to memory (as an initial step), reusing it multiple times incurs no
performance slowdowns.
Also, graph algorithms can traverse graphs one connection per iteration with the partial
result in memory.
Less disk access and network can make a huge difference when you need to process lots of
data, esp. when it is a BIG Data.
Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other
languages and approaches like MapReduce in Java).
Overview of Spark 14
Mastering Apache Spark
One of the many motivations to build Spark was to have a framework that is good at data
reuse.
Spark cuts it out in a way to keep as much data as possible in memory and keep it there
until a job is finished. It doesn’t matter how many stages belong to a job. What does matter
is the available memory and how effective you are in using Spark API (so no shuffle occur).
The less network and disk IO, the better performance, and Spark tries hard to find ways to
minimize both.
The reasonably small codebase of Spark invites project contributors - programmers who
extend the platform and fix bugs in a more steady pace.
Overview of Spark 15
Mastering Apache Spark
Overview
Apache Spark is an open-source parallel distributed general-purpose cluster
computing framework with in-memory big data processing engine with programming
interfaces (APIs) for the programming languages: Scala, Python, Java, and R.
Or, to have a one-liner, Apache Spark is a distributed, data processing engine for batch and
streaming modes featuring SQL queries, graph processing, and Machine Learning.
Using Spark Application Frameworks, Spark simplifies access to machine learning and
predictive analytics at scale.
Spark is mainly written in Scala, but supports other languages, i.e. Java, Python, and R.
If you have large amounts of data that requires low latency processing that a typical
MapReduce program cannot provide, Spark is an alternative.
The Apache Spark project is an umbrella for SQL (with DataFrames), streaming, machine
learning (pipelines) and graph processing engines built atop Spark Core. You can run them
all in a single application using a consistent API.
Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop
YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).
Apache Spark’s Streaming and SQL programming models with MLlib and GraphX make it
easier for developers and data scientists to build applications that exploit machine learning
and graph analytics.
Overview of Spark 16
Mastering Apache Spark
At a high level, any Spark application creates RDDs out of some input, run (lazy)
transformations of these RDDs to some other form (shape), and finally perform actions to
collect or store data. Not much, huh?
You can look at Spark from programmer’s, data engineer’s and administrator’s point of view.
And to be honest, all three types of people will spend quite a lot of their time with Spark to
finally reach the point where they exploit all the available features. Programmers use
language-specific APIs (and work at the level of RDDs using transformations and actions),
data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external
tools (that connect to Spark), and finally it all can only be possible to run because
administrators set up Spark clusters to deploy Spark applications to.
Spark is like emacs - once you join emacs, you can’t leave emacs.
Overview of Spark 17
Mastering Apache Spark
For it to work, you have to create a Spark configuration using SparkConf or use a custom
SparkContext constructor.
package pl.japila.spark
object SparkMeApp {
def main(args: Array[String]) {
Tip Spark shell creates a Spark context and SQL context for you at startup.
You can then create RDDs, transform them to other RDDs and ultimately execute actions.
You can also cache interim RDDs to speed up data processing.
After all the data processing is completed, the Spark application finishes by stopping the
Spark context.
TODO
Describe SparkConf object for the application configuration.
Caution
the default configs
system properties
…
Spark Properties
Every user program starts with creating an instance of SparkConf that holds the master
URL to connect to ( spark.master ), the name for your Spark application (that is later
displayed in web UI and becomes spark.app.name ) and other Spark properties required for
proper runs. An instance of SparkConf is then used to create SparkContext.
Start Spark shell with --conf spark.logConf=true to log the effective Spark
configuration as INFO when SparkContext is started.
You can query for the values of Spark properties in Spark shell as follows:
scala> sc.getConf.getOption("spark.local.dir")
res0: Option[String] = None
scala> sc.getConf.getOption("spark.app.name")
res1: Option[String] = Some(Spark shell)
scala> sc.getConf.get("spark.master")
res2: String = local[*]
Setting up Properties
There are the following ways to set up properties for Spark and user programs (in the order
of importance from the least important to the most important):
SparkConf
Default Configuration
The default Spark configuration is created when you execute the following code:
import org.apache.spark.SparkConf
val conf = new SparkConf
You can use conf.toDebugString or conf.getAll to have the spark.* system properties
loaded printed out.
scala> conf.getAll
res0: Array[(String, String)] = Array((spark.app.name,Spark shell), (spark.jars,""), (spark.master,lo
scala> conf.toDebugString
res1: String =
spark.app.name=Spark shell
spark.jars=
spark.master=local[*]
spark.submit.deployMode=client
scala> println(conf.toDebugString)
spark.app.name=Spark shell
spark.jars=
spark.master=local[*]
spark.submit.deployMode=client
You have to create a Spark context before using Spark features and services in your
application. A Spark context can be used to create RDDs, accumulators and broadcast
variables, access Spark services and run jobs.
A Spark context is essentially a client of Spark’s execution environment and acts as the
master of your Spark application (don’t get confused with the other meaning of Master in
Spark, though).
Creating RDDs
Creating accumulators
Accessing services, e.g. Task Scheduler, Listener Bus, Block Manager, Scheduler
Backends, Shuffle Manager.
Running jobs
Closure Cleaning
Master URL
Caution FIXME
Connecting to a cluster
Application Name
Caution FIXME
SparkContext.makeRDD
Caution FIXME
DAGScheduler.submitJob method).
It is used in:
AsyncRDDActions methods
Spark Configuration
Caution FIXME
Creating SparkContext
You create a SparkContext instance using a SparkConf object.
1. You can also use the other constructor of SparkContext , i.e. new
SparkContext(master="local[*]", appName="SparkMe App", new SparkConf) , with master
When a Spark context starts up you should see the following INFO in the logs (amongst the
other messages that come from services):
Only one SparkContext may be running in a single JVM (check out SPARK-2243 Support
multiple SparkContexts in the same JVM). Sharing access to a SparkContext in the JVM is
the solution to share data within Spark (without relying on other means of data sharing using
external data stores).
spark.driver.allowMultipleContexts
Quoting the scaladoc of org.apache.spark.SparkContext:
Only one SparkContext may be active per JVM. You must stop() the active
SparkContext before creating a new one.
multiple SparkContexts are active, i.e. multiple SparkContext are running in this JVM. When
creating an instance of SparkContext , Spark marks the current thread as having it being
created (very early in the instantiation process).
It’s not guaranteed that Spark will work properly with two or more
Caution
SparkContexts. Consider the feature a work in progress.
When an RDD is created, it belongs to and is completely owned by the Spark context it
originated from. RDDs can’t by design be shared between SparkContexts.
Creating RDD
SparkContext allows you to create many different RDDs from input sources like:
setCheckpointDir(directory: String)
Caution FIXME
Creating accumulators
Caution FIXME
SparkContext comes with broadcast method to broadcast a value among Spark executors.
scala> sc.broadcast("hello")
INFO MemoryStore: Ensuring 1048576 bytes of free space for block broadcast_0(free: 535953408, max: 53
INFO MemoryStore: Ensuring 80 bytes of free space for block broadcast_0(free: 535953408, max: 5359534
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 80.0 B, free 80.0 B)
INFO MemoryStore: Ensuring 34 bytes of free space for block broadcast_0_piece0(free: 535953328, max:
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 34.0 B, free 114
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61505 (size: 34.0 B, free: 511
INFO SparkContext: Created broadcast 0 from broadcast at <console>:25
res0: org.apache.spark.broadcast.Broadcast[String] = Broadcast(0)
Spark transfers the value to Spark executors once, and tasks can share it without incurring
repetitive network transmissions when requested multiple times.
You should not broadcast a RDD to use in tasks and Spark will warn you. It will not stop you,
though. Consult SPARK-5063 Display more helpful error messages for several invalid
operations.
scala> sc.broadcast(rdd)
WARN SparkContext: Can not directly broadcast RDDs; instead, call collect() and broadcast the result
scala> sc.addJar("build.sbt")
15/11/11 21:54:54 INFO SparkContext: Added JAR build.sbt at http://192.168.1.4:49427/jars/build.sbt w
shuffle ids using nextShuffleId internal field for registering shuffle dependencies to
Shuffle Service.
Running Jobs
All RDD actions in Spark launch jobs (that are run on one or many partitions of the RDD)
using SparkContext.runJob(rdd: RDD[T], func: Iterator[T] ⇒ U): Array[U] .
For some actions like first() and lookup() , there is no need to compute all
Tip
the partitions of the RDD in a job. And Spark knows it.
import org.apache.spark.TaskContext
1. Run a job using runJob on lines RDD with a function that returns 1 for every partition
(of lines RDD).
2. What can you say about the number of partitions of the lines RDD? Is your result
res0 different than mine? Why?
partition).
Spark can only run jobs when a Spark context is available and active, i.e.
started. See Stopping Spark context.
Caution
Since SparkContext runs inside a Spark driver, i.e. a Spark application, it
must be alive to run jobs.
Stopping SparkContext
You can stop a Spark context using SparkContext.stop() method. Stopping a Spark context
stops the Spark Runtime Environment and effectively shuts down the entire Spark
application (see Anatomy of Spark Application).
Calling stop many times leads to the following INFO message in the logs:
scala> sc.stop
scala> sc.parallelize(0 to 5)
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Stops web UI
metadataCleaner.cancel()
Stops ContextCleaner
Stops ExecutorAllocationManager
Stops DAGScheduler
Stops EventLoggingListener
Stops HeartbeatReceiver
Stops ConsoleProgressBar
Stops SparkEnv
If all went fine you should see the following INFO message in the logs:
HeartbeatReceiver
Caution FIXME
HeartbeatReceiver is a SparkListener.
Events
When a Spark context starts, it triggers SparkListenerEnvironmentUpdate and
SparkListenerApplicationStart events.
Persisted RDDs
FIXME When is the internal field persistentRdds used?
SparkStatusTracker
SparkStatusTracker requires a Spark context to work. It is created as part of SparkContext’s
initialization.
ConsoleProgressBar
ConsoleProgressBar shows the progress of active stages in console (to stderr ). It polls the
status of stages from SparkStatusTracker periodically and prints out active stages with more
than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent
stages at a time.
The progress includes the stage’s id, the number of completed, active, and total tasks.
It is useful when you ssh to workers and want to see the progress of active stages.
import org.apache.log4j._
Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)
The progress bar prints out the status after a stage has ran at least 500ms , every 200ms
(the values are not configurable).
4. Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the
progress bar.
You may want to use the following example to see the progress bar in full glory - all 3
concurrent stages in console (borrowed from a comment to [SPARK-4017] show progress
bar in console #3029):
Not only does ClosureCleaner.clean method clean the closure, but also does it transitively,
i.e. referenced closures are cleaned transitively.
With DEBUG logging level you should see the following messages in the logs:
Creating a SparkContext
Let’s walk through a typical initialization code of SparkContext in a Spark application and
see what happens under the covers.
The example uses Spark in local mode, i.e. setMaster("local[*]") , but the
Note
initialization with the other cluster modes would follow similar steps.
It all starts with checking whether SparkContexts can be shared or not using
spark.driver.allowMultipleContexts .
The very first information printed out is the version of Spark as an INFO message:
The current user name is computed, i.e. read from a value of SPARK_USER environment
variable or the currently logged-in user. It is available as later on as sparkUser .
scala> sc.sparkUser
res0: String = jacek
The initialization then checks whether a master URL as spark.master and an application
name as spark.app.name are defined. SparkException is thrown if not.
It sets the jars and files based on spark.jars and spark.files , respectively. These are
files that are required for proper task execution on executors.
For yarn-client master URL, the system property SPARK_YARN_MODE is set to true .
MetadataCleaner is created.
true ) is true .
If there are jars given through the SparkContext constructor, they are added using addJar .
Same for files using addFile .
At this point in time, the amount of memory to allocate to each executor (as
_executorMemory ) is calculated. It is the value of spark.executor.memory setting, or
CoarseMesosSchedulerBackend.
FIXME
What’s _executorMemory ?
What’s the unit of the value of _executorMemory exactly?
Caution
What are "SPARK_TESTING", "spark.testing"? How do they contribute
to executorEnvs ?
What’s executorEnvs ?
The internal fields, _applicationId and _applicationAttemptId , are set. Application and
attempt ids are specific to the implementation of Task Scheduler.
The setting spark.app.id is set to _applicationId and Web UI gets notified about the new
value (using setAppId(_applicationId) ). And also Block Manager (using
initialize(_applicationId) ).
scala> sc.getConf.get("spark.app.id")
res1: String = local-1447834845413
Caution FIXME Why should UI and Block Manager know about the application id?
Caution FIXME Why does Metric System need the application id?
The driver’s metrics (servlet handler) are attached to the web ui after the metrics system is
started.
true ).
FIXME It’d be quite useful to have all the properties with their default values
Caution in sc.getConf.toDebugString , so when a configuration is not included but
does change Spark runtime configuration, it should be added to _conf .
setupAndStartListenerBus registers user-defined listeners and starts Listener Bus that starts
Bus with information about Task Scheduler’s scheduling mode, added jar and file paths, and
other environmental details. They are displayed in Web UI’s Environment tab.
Bus.
BlockManagerSource
Hadoop Configuration
While a SparkContext is created, so is a Hadoop configuration (as an instance of
org.apache.hadoop.conf.Configuration that is available as _hadoopConfiguration ).
of AWS_ACCESS_KEY_ID
Every spark.hadoop. setting becomes a setting of the configuration with the prefix
spark.hadoop. removed for the key.
Environment Variables
SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor. See
Executor Memory.
With RDD the creators of Spark managed to hide data partitioning and so distribution that in
turn allowed them to design parallel computational framework with a higher-level
programming interface (API) for four mainstream programming languages.
Figure 1. RDDs
From the scaladoc of org.apache.spark.rdd.RDD:
From the original paper about RDD - Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing:
Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant
manner.
Beside the above traits (that are directly embedded in the name of the data abstraction -
RDD) it has the following additional traits:
Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action
is executed that triggers the execution.
Cacheable, i.e. you can hold all the data in a persistent "storage" like memory (default
and the most preferred) or disk (the least preferred due to access speed).
RDDs are distributed by design and to achieve even data distribution as well as leverage
data locality (in distributed systems like HDFS or Cassandra in which data is partitioned by
default), they are partitioned to a fixed number of partitions - logical chunks (parts) of data.
The logical division is for processing only and internally it is not divided whatsoever. Each
partition comprises of records.
Figure 2. RDDs
Partitions are the units of parallelism. You can control the number of partitions of a RDD
using repartition or coalesce operations. Spark tries to be as close to data as possible
without wasting time to send data across network by means of RDD shuffling, and creates
as many partitions as required to follow the storage layout and thus optimize data access. It
leads to a one-to-one mapping between (physical) data in distributed data storage, e.g.
HDFS or Cassandra, and partitions.
The motivation to create RDD were (after the authors) two types of applications that current
computing frameworks handle inefficiently:
An optional partitioner that defines how keys are hashed, and the pairs partitioned (for
key-value RDDs)
Optional preferred locations, i.e. hosts for a partition where the data will have been
loaded.
This RDD abstraction supports an expressive set of operations without having to modify
scheduler for each one.
An RDD is a named (by name) and uniquely identified (by id) entity inside a SparkContext. It
lives in a SparkContext and as a SparkContext creates a logical boundary, RDDs can’t be
shared between SparkContexts (see SparkContext and RDDs).
An RDD can optionally have a friendly name accessible using name that can be changed
using = :
scala> ns.id
res0: Int = 2
scala> ns.name
res1: String = null
scala> ns.name
res2: String = Friendly name
scala> ns.toDebugString
res3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at <console>:24 []
RDDs are a container of instructions on how to materialize big (arrays of) distributed data,
and how to split it into partitions so Spark (using executors) can hold some of them.
In general, data distribution can help executing processing in parallel so a task processes a
chunk of data that it could eventually keep in memory.
Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in
parallel. Inside a partition, data is processed sequentially.
Saving partitions results in part-files instead of one single file (unless there is a single
partition).
Types of RDDs
There are some of the most interesting types of RDDs:
ParallelCollectionRDD
CoGroupedRDD
HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS
using the older MapReduce API. The most notable use case is the return RDD of
SparkContext.textFile .
SequenceFile .
Appropriate operations of a given RDD type are automatically available on a RDD of the
right type, e.g. RDD[(Int, Int)] , through implicit conversion in Scala.
Transformations
A transformation is a lazy operation on a RDD that returns another RDD, like map ,
flatMap , filter , reduceByKey , join , cogroup , etc.
Actions
An action is an operation that triggers execution of RDD transformations and returns a value
(to a Spark driver - the user program).
Creating RDDs
SparkContext.parallelize
One way to create a RDD is with SparkContext.parallelize method. It accepts a collection
of elements as shown below ( sc is a SparkContext instance):
Given the reason to use Spark to process more data than your own laptop could handle,
SparkContext.parallelize is mainly used to learn Spark in the Spark shell.
SparkContext.makeRDD
SparkContext.textFile
One of the easiest ways to create an RDD is to use SparkContext.textFile to read files.
You can use the local README.md file (and then map it over to have an RDD of sequences
of words):
You cache() it so the computation is not performed every time you work with
Note
words .
Transformations
RDD transformations by definition transform an RDD into another RDD and hance are the
way to create new ones.
RDDs in Web UI
It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for Spark
shell.
Execute the following Spark application (type all the lines in spark-shell ):
With the above executed, you should see the following in the Web UI:
ints.repartition(2).count
values.
It has to be implemented by any type of RDD in Spark and is called unless RDD is
checkpointed (and the result can be read from a checkpoint).
When an RDD is cached, for specified storage levels (i.e. all but NONE ) CacheManager is
requested to get or compute partitions.
Preferred Locations
A preferred location (aka locality preferences or placement preferences) is a block location
for an HDFS file where to compute each partition on.
The following diagram uses cartesian or zip for learning purposes only. You
Note
may use other operators to build a RDD graph.
A RDD lineage graph is hence a graph of what transformations need to be executed after an
action has been called.
You can learn about a RDD lineage graph using RDD.toDebugString method.
toDebugString
You can learn about a RDD lineage graph using RDD.toDebugString method.
scala> wordsCount.toDebugString
res2: String =
(2) ShuffledRDD[24] at reduceByKey at <console>:24 []
+-(2) MapPartitionsRDD[23] at map at <console>:24 []
| MapPartitionsRDD[22] at flatMap at <console>:24 []
| MapPartitionsRDD[21] at textFile at <console>:24 []
| README.md HadoopRDD[20] at textFile at <console>:24 []
spark.logLineage
Enable spark.logLineage (assumed: false ) to see a RDD lineage graph using
RDD.toDebugString method every time an action on a RDD is called.
$ ./bin/spark-shell -c spark.logLineage=true
Execution Plan
Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs
or reference cached data) and ends with the RDD that produces the result of the action that
has been called to execute.
Transformations
Transformations are lazy operations on a RDD that return RDD objects or collections of
RDDs, e.g. map , filter , reduceByKey , join , cogroup , randomSplit , etc.
Transformations are lazy and are not executed immediately, but only after an action has
been executed.
narrow transformations
wide transformations
Narrow Transformations
Narrow transformations are the result of map , filter and such that is from the data from
a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a single partition in the parent
RDD. Only a limited subset of partitions used to calculate the result.
Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey . The data required to
compute the records in a single partition may reside in many partitions of the parent RDD.
All of the tuples with the same key must end up in the same partition, processed by the
same task. To satisfy these operations, Spark must execute RDD shuffle, which transfers
data across cluster and results in a new stage with a new set of partitions.
Actions
Actions are operations that return values, i.e. any RDD operation that returns a value of any
type but RDD[T] is an action.
They trigger execution of RDD transformations to return values. Simply put, an action
evaluates the RDD lineage graph.
You can think of actions as a valve and until no action is fired, the data to be processed is
not even in the pipes, i.e. transformations. Only actions can materialize the entire processing
pipeline with real data.
Actions in org.apache.spark.rdd.RDD:
aggregate
collect
count
countApprox*
countByValue*
first
fold
foreach
foreachPartition
max
min
reduce
take
takeOrdered
takeSample
toLocalIterator
top
treeAggregate
treeReduce
You should cache an RDD you work with when you want to execute two or more
Tip
actions on it for better performance. Refer to RDD Caching / Persistence.
AsyncRDDActions
AsyncRDDActions class offers asynchronous actions that you can use on RDDs (thanks to
countAsync
collectAsync
takeAsync
foreachAsync
foreachPartitionAsync
FutureActions
Caution FIXME
See https://issues.apache.org/jira/browse/SPARK-5063
mapPartitions Operator
Caution FIXME
FIXME
1. How does the number of partitions map to the number of tasks? How to
Caution verify it?
2. How does the mapping between partitions and tasks correspond to data
locality if any?
Spark manages data using partitions that helps parallelize distributed data processing with
minimal network traffic for sending data between executors.
By default, Spark tries to read data into an RDD from the nodes that are close to it. Since
Spark usually accesses distributed partitioned data, to optimize transformation operations it
creates partitions to hold the data chunks.
There is a one-to-one correspondence between how data is laid out in data storage like
HDFS or Cassandra (it is partitioned for the same reasons).
Features:
size
number
partitioning scheme
node distribution
repartitioning
Read the following documentations to learn what experts say on the topic:
By default, a partition is created for each HDFS partition, which by default is 64MB (from
Spark’s Programming Guide).
RDDs get partitioned automatically without programmer intervention. However, there are
times when you’d like to adjust the size and number of partitions or the partitioning scheme
according to the needs of your application.
You use def getPartitions: Array[Partition] method on a RDD to know the set of
partitions in this RDD.
When a stage executes, you can see the number of partitions for a given stage in the
Spark UI.
When you execute the Spark job, i.e. sc.parallelize(1 to 100).count , you should see the
following in Spark shell application UI.
$ sysctl -n hw.ncpu
8
You can request for the minimum number of partitions, using the second input parameter to
many transformations.
scala> ints.partitions.size
res2: Int = 4
Increasing partitions count will make each partition to have less data (or not at all!)
Spark can only run 1 concurrent task for every partition of an RDD, up to the number of
cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least
have 50 partitions (and probably 2-3x times that).
As far as choosing a "good" number of partitions, you generally want at least as many as the
number of executors for parallelism. You can get this computed value by calling
sc.defaultParallelism .
Also, the number of partitions determines how many files get generated by actions that save
RDDs to files.
The maximum size of a partition is ultimately limited by the available memory of an executor.
In the first RDD transformation, e.g. reading from a file using sc.textFile(path, partition) ,
the partition parameter will be applied to all further transformations and actions on this
RDD.
Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may
cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases.
And it usually happens during action stage.
number of blocks as you see in HDFS, but if the lines in your file are too long (longer than
the block size), there will be fewer partitions.
Preferred way to set up the number of partitions for an RDD is to directly pass it as the
second input parameter in the call like rdd = sc.textFile("hdfs://…/file.txt", 400) , where
400 is the number of partitions. In this case, the partitioning makes for 400 splits that would
be done by the Hadoop’s TextInputFormat , not Spark and it would work much faster. It’s
also that the code spawns 400 concurrent tasks to try to load file.txt directly into 400
partitions.
When using textFile with compressed files ( file.txt.gz not file.txt or similar), Spark
disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files
cannot be parallelized). In this case, to change the number of partitions you should do
repartitioning.
Repartitioning
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null) does
With the following computation you can see that repartition(5) causes 5 tasks to be
started using NODE_LOCAL data locality.
scala> lines.repartition(5).count
...
15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (MapPartitionsRDD[
15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks
15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localhost, partition 0
15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localhost, partition 1
15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localhost, partition 2
15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localhost, partition 3
15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localhost, partition 4
...
You can see a change after executing repartition(1) causes 2 tasks to be started using
PROCESS_LOCAL data locality.
scala> lines.repartition(1).count
...
15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (MapPartitions
15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localhost, partition 0
15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localhost, partition 1
...
Please note that Spark disables splitting for compressed files and creates RDDs with only 1
partition. In such cases, it’s helpful to use sc.textFile('demo.gz') and do repartitioning
using rdd.repartition(100) as follows:
rdd = sc.textFile('demo.gz')
rdd = rdd.repartition(100)
With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size.
If partitioning scheme doesn’t work for you, you can write your own custom
Tip
partitioner.
coalesce transformation
The coalesce transformation is used to change the number of partitions. It can trigger RDD
shuffling depending on the second shuffle boolean input parameter (defaults to false ).
In the following sample, you parallelize a local 10-number sequence and coalesce it first
without and then with shuffling (note the shuffle parameter being false and true ,
respectively). You use toDebugString to check out the RDD’s lineage graph.
scala> rdd.partitions.size
res0: Int = 8
scala> res1.toDebugString
res2: String =
(8) CoalescedRDD[1] at coalesce at <console>:27 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
scala> res3.toDebugString
res4: String =
(8) MapPartitionsRDD[5] at coalesce at <console>:27 []
| CoalescedRDD[4] at coalesce at <console>:27 []
| ShuffledRDD[3] at coalesce at <console>:27 []
+-(8) MapPartitionsRDD[2] at coalesce at <console>:27 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
1. shuffle is false by default and it’s explicitly used here for demo’s purposes. Note the
number of partitions that remains the same as the number of partitions in the source
RDD rdd .
Partitioner
Caution FIXME
A partitioner captures data distribution at the output. A scheduler can optimize future
operations based on this.
HashPartitioner
Caution FIXME
HashPartitioner is the default partitioner for coalesce operation when shuffle is allowed,
HashPartitioner.
The cache() operation is a synonym of persist() that uses the default storage level
MEMORY_ONLY .
Storage Levels
StorageLevel describes how an RDD is persisted (and addresses the following concerns):
There are the following StorageLevel (number _2 in the name denotes 2 replicas):
NONE (default)
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP
You can check out the storage level using getStorageLevel() operation.
scala> lines.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
RDD shuffling
Read the official documentation about the topic Shuffle operations. It is still better
Tip
than this page.
Shuffling is a process of repartitioning (redistributing) data across partitions and may cause
moving it across JVMs or even network when it is redistributed among executors.
Avoid shuffling at all cost. Think about ways to leverage existing partitions.
Tip
Leverage partial aggregation to reduce data transfer.
By default, shuffling doesn’t change the number of partitions, but their content.
data.
Example - join
PairRDD offers join transformation that (quoting the official documentation):
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key.
Let’s have a look at an example and see how it works under the covers:
Shuffling 67
Mastering Apache Spark
scala> joined.toDebugString
res7: String =
(8) MapPartitionsRDD[10] at join at <console>:32 []
| MapPartitionsRDD[9] at join at <console>:32 []
| CoGroupedRDD[8] at join at <console>:32 []
+-(8) ParallelCollectionRDD[3] at parallelize at <console>:26 []
+-(8) ParallelCollectionRDD[4] at parallelize at <console>:26 []
It doesn’t look good when there is an "angle" between "nodes" in an operation graph. It
appears before the join operation so shuffle is expected.
Shuffling 68
Mastering Apache Spark
join operation is one of the cogroup operations that uses defaultPartitioner , i.e. walks
through the RDD lineage graph (sorted by the number of partitions decreasing) and picks
the partitioner with positive number of output partitions. Otherwise, it checks
spark.default.parallelism setting and if defined picks HashPartitioner with the default
Shuffling 69
Mastering Apache Spark
Checkpointing
Introduction
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable
distributed (HDFS) or local file system.
reliable - in Spark (core), RDD checkpointing that saves the actual intermediate RDD
data to a reliable distributed file system, e.g. HDFS.
local - in Spark Streaming or GraphX - RDD checkpointing that truncates RDD lineage
graph.
It’s up to a Spark application developer to decide when and how to checkpoint using
RDD.checkpoint() method.
Before checkpointing is used, a Spark developer has to set the checkpoint directory using
SparkContext.setCheckpointDir(directory: String) method.
Reliable Checkpointing
You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory
- the directory where RDDs are checkpointed. The directory must be a HDFS path if
running on a cluster. The reason is that the driver may attempt to reconstruct the
checkpointed RDD from its own local file system, which is incorrect because the checkpoint
files are actually on the executor machines.
You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to
a file inside the checkpoint directory and all references to its parent RDDs will be removed.
This function has to be called before any job has been executed on this RDD.
When an action is called on a checkpointed RDD, the following INFO message is printed out
in the logs:
Checkpointing 70
Mastering Apache Spark
ReliableRDDCheckpointData
When RDD.checkpoint() operation is called, all the information related to RDD
checkpointing are in ReliableRDDCheckpointData .
ReliableCheckpointRDD
After RDD.checkpoint the RDD has ReliableCheckpointRDD as the new parent with the exact
number of partitions as the RDD.
Local Checkpointing
Beside the RDD.checkpoint() method, there is similar one - RDD.localCheckpoint() that
marks the RDD for local checkpointing using Spark’s existing caching layer.
This RDD.localCheckpoint() method is for users who wish to truncate RDD lineage graph
while skipping the expensive step of replicating the materialized data in a reliable distributed
file system. This is useful for RDDs with long lineages that need to be truncated periodically,
e.g. GraphX.
LocalRDDCheckpointData
FIXME
LocalCheckpointRDD
FIXME
Checkpointing 71
Mastering Apache Spark
Dependencies
Dependency (represented by Dependency class) is a connection between RDDs after
applying a transformation.
You can use RDD.dependencies method to know the collection of dependencies of a RDD
( Seq[Dependency[_]] ).
scala> r4.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = ArrayBuffer(org.apache.spark.RangeDependency@6f2ab3f6, or
scala> r4.toDebugString
res1: String =
(24) UnionRDD[23] at union at <console>:24 []
| ParallelCollectionRDD[20] at parallelize at <console>:18 []
| ParallelCollectionRDD[21] at parallelize at <console>:18 []
| ParallelCollectionRDD[22] at parallelize at <console>:18 []
scala> r4.collect
...
res2: Array[Int] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5
Kinds of Dependencies
Dependency is the base abstract class with a single def rdd: RDD[T] method.
scala> r.dependencies.map(_.rdd).foreach(println)
MapPartitionsRDD[11] at groupBy at <console>:18
Dependencies 72
Mastering Apache Spark
NarrowDependency
OneToOneDependency
PruneDependency
RangeDependency
ShuffleDependency
ShuffleDependency
A ShuffleDependency represents a dependency on the output of a shuffle map stage.
scala> r.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@493b0b09)
It uses partitioner to partition the shuffle output. It also uses ShuffleManager to register itself
(using ShuffleManager.registerShuffle) and ContextCleaner to register itself for cleanup
(using ContextCleaner.registerShuffleForCleanup ).
The RDD operations that may or may not use the above RDDs and hence shuffling:
coalesce
repartition
cogroup
Dependencies 73
Mastering Apache Spark
intersection
subtractByKey
subtract
sortByKey
sortBy
repartitionAndSortWithinPartitions
combineByKeyWithClassTag
combineByKey
aggregateByKey
foldByKey
reduceByKey
countApproxDistinctByKey
groupByKey
partitionBy
Note There may be other dependent methods that use the above.
NarrowDependency
NarrowDependency is an abstract extension of Dependency with narrow (limited) number of
partitions of the parent RDD that are required to compute a partition of the child RDD.
Narrow dependencies allow for pipelined execution.
to get the parent partitions for a partition partitionId of the child RDD.
OneToOneDependency
OneToOneDependency is a narrow dependency that represents a one-to-one dependency
Dependencies 74
Mastering Apache Spark
scala> r3.dependencies
res32: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@7353a0fb)
scala> r3.toDebugString
res33: String =
(8) MapPartitionsRDD[19] at map at <console>:20 []
| ParallelCollectionRDD[13] at parallelize at <console>:18 []
PruneDependency
PruneDependency is a narrow dependency that represents a dependency between the
RangeDependency
RangeDependency is a narrow dependency that represents a one-to-one dependency
scala> unioned.dependencies
res19: Seq[org.apache.spark.Dependency[_]] = ArrayBuffer(org.apache.spark.RangeDependency@28408ad7, o
scala> unioned.toDebugString
res18: String =
(16) UnionRDD[16] at union at <console>:22 []
| ParallelCollectionRDD[13] at parallelize at <console>:18 []
| ParallelCollectionRDD[14] at parallelize at <console>:18 []
Dependencies 75
Mastering Apache Spark
ParallelCollectionRDD
ParallelCollectionRDD is an RDD of a collection of elements with numSlices partitions and
optional locationPrefs .
methods.
It uses ParallelCollectionPartition .
Dependencies 76
Mastering Apache Spark
MapPartitionsRDD
MapPartitionsRDD is an RDD that applies the provided function f to every partition of the
parent RDD.
RDD.map
RDD.flatMap
RDD.filter
RDD.glom
RDD.mapPartitions
RDD.mapPartitionsWithIndex
PairRDDFunctions.mapValues
PairRDDFunctions.flatMapValues
Dependencies 77
Mastering Apache Spark
PairRDDFunctions
Tip Read up the scaladoc of PairRDDFunctions.
PairRDDFunctions are available in RDDs of key-value pairs via Scala’s implicit conversion.
Partitioning is an advanced feature that is directly linked to (or inferred by) use
Tip
of PairRDDFunctions . Read up about it in Partitions and Partitioning.
It may often not be important to have a given number of partitions upfront (at RDD creation
time upon loading data from data sources), so only "regrouping" the data by key after it is an
RDD might be…the key (pun not intended).
You can use groupByKey or another PairRDDFunctions method to have a key in one
processing flow.
You could use partitionBy that is available for RDDs to be RDDs of tuples, i.e. PairRDD :
rdd.keyBy(_.kind)
.partitionBy(new HashPartitioner(PARTITIONS))
.foreachPartition(...)
Think of situations where kind has low cardinality or highly skewed distribution and using
the technique for partitioning might be not an optimal solution.
rdd.keyBy(_.kind).reduceByKey(....)
mapValues, flatMapValues
Caution FIXME
combineByKeyWithClassTag
Dependencies 78
Mastering Apache Spark
default. It then creates ShuffledRDD with the value of mapSideCombine when the input
partitioner is different from the current one in an RDD.
Dependencies 79
Mastering Apache Spark
CoGroupedRDD
A RDD that cogroups its pair RDD parents. For each key k in parent RDDs, the resulting
RDD contains a tuple with the list of values for that key.
Dependencies 80
Mastering Apache Spark
HadoopRDD
HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS, a
local file system (available on all nodes), or any Hadoop-supported file system URI using the
older MapReduce API (org.apache.hadoop.mapred).
hadoopFile
sequenceFile
When an HadoopRDD is computed, i.e. an action is called, you should see the INFO
message Input split: in the logs.
scala> sc.textFile("README.md").count
...
15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784
15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:1784+1784
...
mapred.task.is.map as true
mapred.task.partition - split id
mapred.job.id
Dependencies 81
Mastering Apache Spark
FIXME
getPartitions
The number of partition for HadoopRDD, i.e. the return value of getPartitions , is
calculated using InputFormat.getSplits(jobConf, minPartitions) where minPartitions is
only a hint of how many partitions one may want at minimum. As a hint it does not mean the
number of partitions will be exactly the number given.
FileInputFormat is the base class for all file-based InputFormats. This provides a
generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can
also override the isSplitable(FileSystem, Path) method to ensure input-files are not
split-up and are processed as a whole by Mappers.
Dependencies 82
Mastering Apache Spark
ShuffledRDD
ShuffledRDD is an RDD of (key, value) pairs. It is a shuffle step (the result RDD) for
transformations that trigger shuffle at execution. Such transformations ultimately call
coalesce transformation with shuffle input parameter true (default: false ).
scala> r.toDebugString
res0: String =
(3) ShuffledRDD[2] at groupBy at <console>:18 []
+-(3) MapPartitionsRDD[1] at groupBy at <console>:18 []
| ParallelCollectionRDD[0] at parallelize at <console>:18 []
As you may have noticed, groupBy transformation adds ShuffledRDD RDD that will execute
shuffling at execution time (as depicted in the following screenshot).
repartitionAndSortWithinPartitions
sortByKey (be very careful due to [SPARK-1021] sortByKey() launches a cluster job
when it shouldn’t)
partitionBy (only when the input partitioner is different from the current one in an
RDD)
It uses Partitioner.
Dependencies 83
Mastering Apache Spark
Dependencies 84
Mastering Apache Spark
BlockRDD
Caution FIXME
Dependencies 85
Mastering Apache Spark
Spark shell
Spark shell is an interactive shell for learning about Apache Spark, ad-hoc queries and
developing Spark applications. It is a very convenient tool to explore the many things
available in Spark and one of the many reasons why Spark is so helpful even for very simple
tasks (see Why Spark).
There are variants of Spark for different languages: spark-shell for Scala and pyspark for
Python.
Spark shell boils down to executing Spark submit and so command-line arguments of Spark
submit become Spark shell’s, e.g. --verbose .
Spark Shell 86
Mastering Apache Spark
$ ./bin/spark-shell
Spark context available as sc.
SQL context available as sqlContext.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark shell gives you the sc value which is the SparkContext for the session.
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2ac0cb64
scala> sqlContext
res1: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@60ae950f
To close Spark shell, you press Ctrl+D or type in :q (or any subset of :quit ).
scala> :quit
While trying it out using an incorrect value for the master’s URL, you’re told about --help
and --verbose options.
Spark Shell 87
Mastering Apache Spark
Tip These `null’s could instead be replaced with some other, more meaningful values.
Spark Shell 88
Mastering Apache Spark
Jobs
Stages
Environment
Executors
SQL
You can view the web UI after the fact after setting spark.eventLog.enabled to true before
starting the application.
Environment Tab
JobProgressListener
JobProgressListener is the listener for Web UI.
Settings
spark.ui.enabled (default: true ) setting controls whether the web UI is started at all.
If multiple SparkContexts attempt to run on the same host (it is not possible to have two
or more Spark contexts on a single JVM, though), they will bind to successive ports
beginning with spark.ui.port .
spark.ui.killEnabled (default: true ) - whether or not you can kill stages in web UI.
Executors Tab
Caution FIXME
Executors Tab 91
Mastering Apache Spark
spark-submit script
You use spark-submit script to launch a Spark application, i.e. submit the application to a
Spark deployment environment.
You can find spark-submit script in bin directory of the Spark distribution.
Deploy Modes
Using --deploy-mode command-line option you can specify two different deploy modes:
client (default)
cluster
Command-line Options
Execute ./bin/spark-submit --help to know about the command-line options supported.
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
spark-submit 92
Mastering Apache Spark
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
spark-submit 93
Mastering Apache Spark
--class
--conf or -c
--deploy-mode
--driver-class-path
--driver-java-options
--driver-library-path
--driver-memory
--executor-memory
--files
--jars
--master
--name
--packages
--exclude-packages
--properties-file
--proxy-user
--py-files
--repositories
--total-executor-cores
--help or -h
--usage-error
--verbose or -v
spark-submit 94
Mastering Apache Spark
--version
YARN-only options:
--archives
--executor-cores
--keytab
--num-executors
--principal
--queue
Environment Variables
The following is the list of environment variables that are considered when command-line
options are not specified:
SPARK_EXECUTOR_CORES
DEPLOY_MODE
SPARK_YARN_APP_NAME
_SPARK_CMD_USAGE
./bin/spark-submit \
--packages my:awesome:package \
--repositories s3n://$aws_ak:$aws_sak@bucket/path/to/repo
spark-submit 95
Mastering Apache Spark
SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Note
At startup, the spark-class script loads additional environment settings (see section on
spark-env.sh in this document).
And then spark-class searches for so-called the Spark assembly jar ( spark-
assembly.hadoop..jar ) in SPARK_HOME/lib or SPARK_HOME/assembly/target/scala-
spark-submit at the very beginning). The Main class programmatically computes the final
command to be executed.
OptionParser class
SparkSubmit itself
spark-submit 96
Mastering Apache Spark
export JAVA_HOME=/your/directory/java
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=1G
spark-submit 97
Mastering Apache Spark
spark-class
bin/spark-class shell script is the script launcher for internal Spark classes.
org.apache.spark.launcher.Main
org.apache.spark.launcher.Main is the command-line launcher used in Spark scripts, like
spark-class .
It builds the command line for a Spark class using the environment variables:
the command.
spark-class 98
Mastering Apache Spark
Spark Architecture
Spark uses a master/worker architecture. There is a driver that talks to a single
coordinator called master that manages workers in which executors run.
Spark Architecture 99
Mastering Apache Spark
Driver
A Spark driver is the process that creates and owns an instance of SparkContext. It is your
Spark application that launches the main method in which the instance of SparkContext is
created. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task
Scheduler). It hosts Web UI for the environment.
A driver is where the task scheduler lives and spawns tasks across workers.
Driver requires the additional services (beside the common ones like ShuffleManager,
MemoryManager, BlockTransferService, BroadcastManager, CacheManager):
Driver 101
Mastering Apache Spark
Listener Bus
driverActorSystemName
HttpFileServer
Driver 102
Mastering Apache Spark
Master
A master is a running Spark instance that connects to a cluster manager for resources.
Master 103
Mastering Apache Spark
Workers
Workers (aka slaves) are running Spark instances where executors live to execute tasks.
They are the compute nodes in Spark.
It hosts a local Block Manager that serves blocks to other workers in a Spark cluster.
Workers communicate among themselves using their Block Manager instances.
Explain task execution in Spark and understand Spark’s underlying execution model.
When you create SparkContext, each worker starts an executor. This is a separate process
(JVM), and it loads your jar, too. The executors connect back to your driver program. Now
the driver can send them commands, like flatMap , map and reduceByKey . When the
driver quits, the executors shut down.
A new process is not started for each step. A new process is started on each worker when
the SparkContext is constructed.
The executor deserializes the command (this is possible because it has loaded your jar),
and executes it on a partition.
1. Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire
computation.
2. Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the
RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
Workers 104
Mastering Apache Spark
Based on this graph, two stages are created. The stage creation rule is based on the idea of
pipelining as many narrow transformations as possible. RDD operations with "narrow"
dependencies, like map() and filter() , are pipelined together into one set of tasks in
each stage.
In the end, every stage will only have shuffle dependencies on other stages, and may
compute multiple operations inside it.
In the WordCount example, the narrow transformation finishes at per-word count. Therefore,
you get two stages:
Once stages are defined, Spark will generate tasks from stages. The first stage will create a
series of ShuffleMapTask and the last stage will create ResultTasks because in the last
stage, one action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose
that you have 3 three different files in three different nodes, the first stage will generate 3
tasks: one task per partition.
Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is
related to a partition.
The number of tasks being generated in each stage will be equal to the number of partitions.
Cleanup
Caution FIXME
Settings
spark.worker.cleanup.enabled (default: false ) Cleanup enabled.
Workers 105
Mastering Apache Spark
Executors
Executors are distributed agents responsible for executing tasks. They typically (i.e. not
always) run for the entire lifetime of a Spark application. Executors send active task metrics
to a driver. They also inform executor backends about task status updates (task results
including).
Executors provide in-memory storage for RDDs that are cached in Spark applications (via
Block Manager).
When executors are started they register themselves with the driver and communicate
directly to execute tasks.
Executor offers are described by executor id and the host on which an executor runs (see
Resource Offers in this document).
Executors use a thread pool for sending metrics and launching tasks (by means of
TaskRunner).
Each executor can run multiple tasks over its lifetime, both in parallel and sequentially.
It is recommended to have as many executors as data nodes and as many cores as you can
get from the cluster.
Executors are described by their id, hostname, environment (as SparkEnv ), and
classpath (and, less importantly, and more for internal optimization, whether they run in
local or cluster mode).
Refer to Logging.
When an executor is started you should see the following INFO messages in the logs:
Executors 106
Mastering Apache Spark
(only for non-local mode) Executors initialize the application using BlockManager.initialize().
Executors maintain a mapping between task ids and running tasks (as instances of
TaskRunner) in runningTasks internal map. Consult Launching Tasks section.
A worker requires the additional services (beside the common ones like …):
executorActorSystemName
MapOutputTrackerWorker
Coarse-Grained Executors
Coarse-grained executors are executors that use CoarseGrainedExecutorBackend for task
scheduling.
Launching Tasks
Executor.launchTask creates a TaskRunner that is then executed on Executor task launch
Executors 107
Mastering Apache Spark
FIXME
What’s in taskRunner.task.metrics ?
Caution
What’s in Heartbeat ? Why is blockManagerId sent?
What’s in RpcUtils.makeDriverRef ?
TaskRunner
TaskRunner is a thread of execution that runs a task. It requires a ExecutorBackend (to
send status updates to), task and attempt ids, task name, and serialized version of the task
(as ByteBuffer ). It sends updates about task execution to the ExecutorBackend.
When a TaskRunner starts running, it prints the following INFO message to the logs:
taskId is the id of the task being executed in Executor task launch worker-[taskId].
Executors 108
Mastering Apache Spark
TaskRunner can run a single task only. When TaskRunner finishes, it is removed from the
internal runningTasks map.
instance (using the globally-configured Serializer ). The Task instance has the
TaskMemoryManager set.
This is the moment when a task can stop its execution if it was killed while being
deserialized. If not killed, TaskRunner continues executing the task.
Task runs (with taskId , attemptNumber , and the globally-configured MetricsSystem ). See
Task Execution.
The result value is serialized (using the other instance of Serializer , i.e. serializer -
there are two Serializer instances in SparkContext.env ).
If maxResultSize is set and the size of the serialized result exceeds the value, a
SparkException is reported.
scala> sc.getConf.get("spark.driver.maxResultSize")
res5: String = 1m
Executors 109
Mastering Apache Spark
INFO Executor: Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to driver
FetchFailedException
Caution FIXME
shuffleId
mapId
reduceId
TaskRunner catches it and informs ExecutorBackend about the case (using statusUpdate
with TaskState.FAILED task state).
Resource Offers
Read resourceOffers in TaskSchedulerImpl and resourceOffer in TaskSetManager.
Executors 110
Mastering Apache Spark
You can change the assigned memory per executor per node in standalone cluster using
SPARK_EXECUTOR_MEMORY environment variable.
You can find the value displayed as Memory per Node in web UI for standalone Master (as
depicted in the figure below).
Metrics
Executors use Metrics System (via ExecutorSource ) to report metrics about internal status.
Executors 111
Mastering Apache Spark
Note Metrics are only available for cluster modes, i.e. local mode turns metrics off.
filesystem.file.read_bytes
filesystem.file.write_bytes
filesystem.file.read_ops
filesystem.file.largeRead_ops
filesystem.file.write_ops
Settings
spark.executor.cores - the number of cores for an executor
Executors 112
Mastering Apache Spark
reports heartbeat and metrics for active tasks to the driver. Refer to Sending heartbeats
and partial metrics for active tasks.
spark.executor.id
Dynamic Allocation.
spark.executor.logs.rolling.maxSize
spark.executor.logs.rolling.maxRetainedFiles
spark.executor.logs.rolling.strategy
spark.executor.logs.rolling.time.interval
spark.executor.port
ClassLoader to load new classes defined in the Scala REPL as a user types code.
spark.akka.frameSize (default: 128 MB, maximum: 2047 MB) - the configured max
frame size for Akka messages. If a task result is bigger, executors use block manager to
send results back.
spark.driver.maxResultSize (default: 1g )
Executors 113
Mastering Apache Spark
Executors 114
Mastering Apache Spark
Spark Runtime Environment is represented by a SparkEnv object that holds all the required
services for a running Spark instance, i.e. a master or an executor.
SparkEnv
SparkEnv holds all runtime objects for a running Spark instance, using
SparkEnv.createDriverEnv() for a driver and SparkEnv.createExecutorEnv() for an executor.
scala> SparkEnv.get
res0: org.apache.spark.SparkEnv = org.apache.spark.SparkEnv@2220c5f7
SparkEnv.create()
SparkEnv.create is a common initialization procedure to create a Spark execution
It creates a NettyBlockTransferService.
It creates a BlockManagerMaster.
It creates a BlockManager.
It creates a BroadcastManager.
It creates a CacheManager.
It initializes userFiles temporary directory used for downloading dependencies for a driver
while this is the executor’s current working directory for an executor.
An OutputCommitCoordinator is created.
SparkEnv.createDriverEnv()
SparkEnv.createDriverEnv creates a driver’s (execution) environment that is the Spark
The method accepts an instance of SparkConf, whether it runs in local mode or not, an
instance of listener bus, the number of driver’s cores to use for execution in local mode or
0 otherwise, and a OutputCommitCoordinator (default: none).
For Akka-based RPC Environment (obsolete since Spark 1.6.0-SNAPSHOT), the name of
the actor system for the driver is sparkDriver. See clientMode how it is created in detail.
SparkEnv.createExecutorEnv()
SparkEnv.createExecutorEnv creates an executor’s (execution) environment that is the
It uses SparkConf, the executor’s identifier, hostname, port, the number of cores, and
whether or not it runs in local mode.
For Akka-based RPC Environment (obsolete since Spark 1.6.0-SNAPSHOT), the name of
the actor system for an executor is sparkExecutor.
Settings
spark.driver.host - the name of the machine where the (active) driver runs.
Serializer.
Serializer.
hash or org.apache.spark.shuffle.hash.HashShuffleManager
sort or org.apache.spark.shuffle.sort.SortShuffleManager
tungsten-sort or org.apache.spark.shuffle.sort.SortShuffleManager
UnifiedMemoryManager ( false ).
DAGScheduler
The introduction that follows was highly influenced by the scaladoc of
org.apache.spark.scheduler.DAGScheduler. As DAGScheduler is a private
class it does not appear in the official API documentation. You are strongly
Note encouraged to read the sources and only then read this and the related pages
afterwards.
"Reading the sources", I say?! Yes, I am kidding!
Introduction
DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented
scheduling, i.e. after an RDD action has been called it becomes a job that is then
transformed into a set of stages that are submitted as TaskSets for execution (see Execution
Model).
DAGScheduler 119
Mastering Apache Spark
It computes a directed acyclic graph (DAG) of stages for each job, keeps track of which
RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then
submits stages to TaskScheduler.
In addition to coming up with the execution DAG, DAGScheduler also determines the
preferred locations to run each task on, based on the current cache status, and passes the
information to TaskScheduler.
Furthermore, it handles failures due to shuffle output files being lost, in which case old
stages may need to be resubmitted. Failures within a stage that are not caused by shuffle
file loss are handled by the TaskScheduler itself, which will retry each task a small number of
times before cancelling the whole stage.
reads and executes sequentially. See the section Internal Event Loop - dag-scheduler-event-
loop.
DAGScheduler 120
Mastering Apache Spark
Turn DEBUG or more detailed TRACE logging level to see what happens inside
DAGScheduler .
log4j.logger.org.apache.spark.scheduler.DAGScheduler=TRACE
DAGScheduler reports metrics about its execution (refer to the section Metrics).
Internal Registries
DAGScheduler maintains the following information in internal registries:
failedStages for stages that failed due to fetch failures (as reported by
DAGScheduler 121
Mastering Apache Spark
cacheLocs is a mapping between RDD ids and their cache preferences per partition (as
arrays indexed by partition numbers). Each array value is the set of locations where that
RDD partition is cached on. See Cache Tracking.
failedEpoch is a mapping between failed executors and the epoch number when the
FIXME Review
Caution
cleanupStateForJobAndIndependentStages
DAGScheduler.resubmitFailedStages
resubmitFailedStages() is called to go over failedStages collection (of failed stages) and
If the failed stages collection contains any stage, the following INFO message appears in the
logs:
cacheLocs and failedStages are cleared, and failed stages are submitStage one by one,
DAGScheduler.runJob
When executed, DAGScheduler.runJob is given the following arguments:
A set of partitions to run on (not all partitions are always required to compute a job for
actions like first() or take() ).
It calls DAGScheduler.submitJob and then waits until a result comes using a JobWaiter
object. A job can succeed or fail.
DAGScheduler 122
Mastering Apache Spark
DAGScheduler.submitJob
DAGScheduler.submitJob is called by SparkContext.submitJob and DAGScheduler.runJob.
Checks whether the set of partitions to run a function on are in the the range of
available partitions of the RDD.
Figure 3. DAGScheduler.submitJob
You may see an exception thrown when the partitions in the set are outside the range:
DAGScheduler 123
Mastering Apache Spark
A job listener is notified each time a task succeeds (by def taskSucceeded(index: Int,
result: Any) ), as well as if the whole job fails (by def jobFailed(exception: Exception) ).
In ActiveJob as a listener to notify if tasks in this job finish or the job fails.
In DAGScheduler.handleJobSubmitted
In DAGScheduler.handleMapStageSubmitted
In JobSubmitted
In MapStageSubmitted
JobWaiter waits until DAGScheduler completes the job and passes the results of tasks
to a resultHandler function.
ApproximateActionListener FIXME
JobWaiter
A JobWaiter is an extension of JobListener. It is used as the return value of
DAGScheduler.submitJob and DAGScheduler.submitMapStage . You can use a JobWaiter to
block until the job finishes executing or to cancel it.
While the methods execute, JobSubmitted and MapStageSubmitted events are posted that
reference the JobWaiter.
DAGScheduler 124
Mastering Apache Spark
DAGScheduler.executorAdded
executorAdded(execId: String, host: String) method simply posts a ExecutorAdded event
to eventProcessLoop .
DAGScheduler.taskEnded
taskEnded(task: Task[_], reason: TaskEndReason, result: Any, accumUpdates: Map[Long, Any],
event to eventProcessLoop .
process loop to which Spark (by DAGScheduler.submitJob) posts jobs to schedule their
execution. Later on, TaskSetManager talks back to DAGScheduler to inform about the status
of the tasks using the same "communication channel".
It allows Spark to release the current thread when posting happens and let the event loop
handle events on a separate thread - asynchronously.
…IMAGE…FIXME
The name of the single "logic" thread that reads events and takes decisions is dag-
scheduler-event-loop.
The following are the current types of DAGSchedulerEvent events that are handled by
DAGScheduler :
DAGScheduler 125
Mastering Apache Spark
StageCancelled
JobCancelled
JobGroupCancelled
AllJobsCancelled
ExecutorLost
TaskSetFailed
ResubmitFailedStages
FIXME
statistics? MapOutputStatistics ?
DAGScheduler 126
Mastering Apache Spark
Caution FIXME
It checks failedEpoch for the executor id (using execId ) and if it is found the following INFO
message appears in the logs:
false .
DAGScheduler 127
Mastering Apache Spark
Figure 4. DAGScheduler.handleExecutorLost
Recurring ExecutorLost events merely lead to the following DEBUG message in the logs:
If however the executor is not in the list of executor lost or the failed epoch number is
smaller than the current one, the executor is added to failedEpoch.
BlockManagerMaster.removeExecutor(execId) is called.
If no external shuffle service is in use or the ExecutorLost event was for a map output fetch
operation, all ShuffleMapStages (using shuffleToMapStage ) are called (in order):
ShuffleMapStage.removeOutputsOnExecutor(execId)
MapOutputTrackerMaster.registerMapOutputs(shuffleId,
stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)
cacheLocs is cleared.
DAGScheduler 128
Mastering Apache Spark
and for each job associated with the stage, it calls handleJobCancellation(jobId, s"because
Stage [stageId] was cancelled") .
Note A stage knows what jobs it is part of using the internal set jobIds .
def handleJobCancellation(jobId: Int, reason: String = "") checks whether the job exists
However, if the job exists, the job and all the stages that are only used by it (using
failJobAndIndependentStages method).
For each running stage associated with the job ( jobIdToStageIds ), if there is only one job
for the stage ( stageIdToStage ), TaskScheduler.cancelTasks is called,
outputCommitCoordinator.stageEnd(stage.id) , and SparkListenerStageCompleted is posted.
job on cancel.
If no stage exists for stageId , the following INFO message shows in the logs:
DAGScheduler 129
Mastering Apache Spark
A SparkListenerJobStart event is posted to Listener Bus (so other event listeners know
about the event - not only DAGScheduler).
DAGScheduler 130
Mastering Apache Spark
handleMapStageSubmitted INFO logs Got map stage job %s (%s) with %d output
partitions with dependency.rdd.partitions.length while handleJobSubmitted
does Got job %s (%s) with %d output partitions with partitions.length .
Tip FIXME: Could the above be cut to ActiveJob.numPartitions ?
handleMapStageSubmitted adds a new job with finalStage.addActiveJob(job)
while handleJobSubmitted sets with finalStage.setActiveJob(job) .
handleMapStageSubmitted checks if the final stage has already finished, tells the
listener and removes it using the code:
if (finalStage.isAvailable) {
markMapStageJobAsFinished(job, mapOutputTracker.getStatistics(dependency))
}
DAGScheduler 131
Mastering Apache Spark
Figure 6. DAGScheduler.handleJobSubmitted
handleJobSubmitted has access to the final RDD, the partitions to compute, and the
It creates a new ResultStage (as finalStage on the picture) and instantiates ActiveJob .
INFO DAGScheduler: Got job [jobId] ([callSite.shortForm]) with [partitions.length] output partitions
INFO DAGScheduler: Final stage: [finalStage] ([finalStage.name])
INFO DAGScheduler: Parents of final stage: [finalStage.parents]
INFO DAGScheduler: Missing parents: [getMissingParentStages(finalStage)]
Then, the finalStage stage is given the ActiveJob instance and some housekeeping is
performed to track the job (using jobIdToActiveJob and activeJobs ).
When DAGScheduler executes a job it first submits the final stage (using submitStage).
DAGScheduler 132
Mastering Apache Spark
The task knows about the stage it belongs to (using Task.stageId ), the partition it works on
(using Task.partitionId ), and the stage attempt (using Task.stageAttemptId ).
OutputCommitCoordinator.taskCompleted is called.
If the stage the task belongs to has been cancelled, stageIdToStage should not contain it,
and the method quits.
The main processing begins now depending on TaskEndReason - the reason for task
completion (using event.reason ). The method skips processing TaskEndReasons :
TaskCommitDenied , ExceptionFailure , TaskResultLost , ExecutorLostFailure , TaskKilled ,
TaskEndReason: Success
SparkListenerTaskEnd is posted on Listener Bus.
The partition the task worked on is removed from pendingPartitions of the stage.
DAGScheduler 133
Mastering Apache Spark
ResultTask
For ResultTask, the stage is ResultStage . If there is no job active for the stage (using
resultStage.activeJob ), the following INFO message appears in the logs:
INFO Ignoring result from [task] because its job has finished
Otherwise, check whether the task is marked as running for the job (using job.finished )
and proceed. The method skips execution when the task has already been marked as
completed in the job.
FIXME When could a task that has just finished be ignored, i.e. the job has
Caution
already marked finished ? Could it be for stragglers?
updateAccumulators(event) is called.
The partition is marked as finished (using job.finished ) and the number of partitions
calculated increased (using job.numFinished ).
markStageAsFinished is called
cleanupStateForJobAndIndependentStages(job)
The JobListener of the job (using job.listener ) is informed about the task completion
(using job.listener.taskSucceeded(rt.outputId, event.result) ). If the step fails, i.e. throws
an exception, the JobListener is informed about it (using job.listener.jobFailed(new
SparkDriverExecutionException(e)) ).
ShuffleMapTask
updateAccumulators(event) is called.
event.result is MapStatus that knows the executor id where the task has finished (using
status.location.executorId ).
DAGScheduler 134
Mastering Apache Spark
If failedEpoch contains the executor and the epoch of the ShuffleMapTask is not greater than
that in failedEpoch, you should see the following INFO message in the logs:
The method does more processing only if the internal runningStages contains the
ShuffleMapStage with no more pending partitions to compute (using
shuffleStage.pendingPartitions ).
markStageAsFinished(shuffleStage) is called.
cacheLocs is cleared.
If the map stage is ready, i.e. all partitions have shuffle outputs, map-stage jobs waiting on
this stage (using shuffleStage.mapStageJobs ) are marked as finished.
MapOutputTrackerMaster.getStatistics(shuffleStage.shuffleDep) is called and every map-
stage job is markMapStageJobAsFinished(job, stats) .
Otherwise, if the map stage is not ready, the following INFO message appears in the logs:
INFO Resubmitting [shuffleStage] ([shuffleStage.name]) because some of its tasks had failed: [missing
submitStage(shuffleStage) is called.
TaskEndReason: Resubmitted
For Resubmitted case, you should see the following INFO message in the logs:
DAGScheduler 135
Mastering Apache Spark
The task (by task.partitionId ) is added to the collection of pending partitions of the stage
(using stage.pendingPartitions ).
A stage knows how many partitions are yet to be calculated. A task knows about
Tip
the partition id for which it was launched.
TaskEndReason: FetchFailed
FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) comes with
When FetchFailed happens, stageIdToStage is used to access the failed stage (using
task.stageId and the task is available in event in handleTaskCompletion(event:
INFO Ignoring fetch failure from [task] as it's from [failedStage] attempt [task.stageAttemptId] and
If the failed stage is in runningStages , the following INFO message shows in the logs:
INFO Marking [failedStage] ([failedStage.name]) as failed due to a fetch failure from [mapStage] ([ma
If the failed stage is not in runningStages , the following DEBUG message shows in the logs:
DEBUG Received fetch failure from [task], but its from [failedStage] which is no longer running
DAGScheduler 136
Mastering Apache Spark
If the number of fetch failed attempts for the stage exceeds the allowed number (using
Stage.failedOnFetchAndShouldAbort), the following method is called:
If there are no failed stages reported (failedStages is empty), the following INFO shows in
the logs:
messageScheduler.schedule(
new Runnable {
override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
}, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
For all the cases, the failed stage and map stages are both added to failedStages set.
If mapId (in the FetchFailed object for the case) is provided, the map stage output is
cleaned up (as it is broken) using mapStage.removeOutputLoc(mapId, bmAddress) and
MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress) methods.
DAGScheduler 137
Mastering Apache Spark
DAGScheduler.submitWaitingStages method checks for waiting or failed stages that could now
The following TRACE messages show in the logs when the method is called:
The method clears the internal waitingStages set with stages that wait for their parent
stages to finish.
It goes over the waiting stages sorted by job ids in increasing order and calls submitStage
method.
Caution FIXME
There has to be an ActiveJob instance for the stage to proceed. Otherwise the stage and all
the dependent jobs are aborted (using abortStage ) with the message:
Job aborted due to stage failure: No active job for stage [stage.id]
For a stage with ActiveJob available, the following DEBUG message show up in the logs:
Only when the stage is not in waiting ( waitingStages ), running ( runningStages ) or failed
states can this stage be processed.
A list of missing parent stages of the stage is calculated (see Calculating Missing Parent
Stages) and the following DEBUG message shows up in the logs:
When the stage has no parent stages missing, it is submitted and the INFO message shows
up in the logs:
DAGScheduler 138
Mastering Apache Spark
If however there are missing parent stages for the stage, all stages are processed
recursively (using submitStage), and the stage is added to waitingStages set.
It starts with the stage’s target RDD (as stage.rdd ). If there are uncached partitions, it
traverses the dependencies of the RDD (as RDD.dependencies ) that can be the instances of
ShuffleDependency or NarrowDependency.
For each ShuffleDependency, the method searches for the corresponding ShuffleMapStage
(using getShuffleMapStage ) and if unavailable, the method adds it to a set of missing (map)
stages.
If TaskScheduler reports that a task failed because a map output file from a previous stage
was lost, the DAGScheduler resubmits that lost stage. This is detected through a
CompletionEvent with FetchFailed , or an ExecutorLost event. DAGScheduler will wait a
small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for
any lost stage(s) that compute the missing tasks.
Please note that tasks from the old attempts of a stage could still be running.
A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.
The latest StageInfo for the most recent attempt for a stage is accessible through
latestInfo .
DAGScheduler 139
Mastering Apache Spark
Cache Tracking
DAGScheduler tracks which RDDs are cached to avoid recomputing them and likewise
remembers which shuffle map stages have already produced output files to avoid redoing
the map side of a shuffle.
DAGScheduler is only interested in cache location coordinates, i.e. host and executor id, per
partition of an RDD.
If the storage level of an RDD is NONE, there is no caching and hence no partition cache
locations are available. In such cases, whenever asked, DAGScheduler returns a collection
with empty-location elements for each partition. The empty-location elements are to mark
uncached partitions.
Preferred Locations
DAGScheduler computes where to run each task in a stage based on the preferred locations
of its underlying RDDs, or the location of cached or shuffle data.
DAGScheduler 140
Mastering Apache Spark
stages of the current stage stage have already been finished and it is now possible to run
tasks for it.
pendingPartitions internal field of the stage is cleared (it is later filled out with the partitions
The mapping between task ids and task preferred locations is computed (see
getPreferredLocs - Computing Preferred Locations for Tasks and Partitions).
SparkListenerStageSubmitted is posted.
The stage is serialized and broadcast to workers using SparkContext.broadcast method, i.e.
it is Serializer.serialize to calculate taskBinaryBytes - an array of bytes of (rdd, func) for
ResultStage and (rdd, shuffleDep) for ShuffleMapStage .
When serializing the stage fails, the stage is removed from the internal runningStages set,
abortStage is called and the method stops.
For each partition to compute for the stage, a collection of ShuffleMapTask for
ShuffleMapStage or ResultTask for ResultStage is created.
DAGScheduler 141
Mastering Apache Spark
Caution FIXME Image with creating tasks for partitions in the stage.
If there are tasks to launch (there are missing partitions in the stage), the following INFO and
DEBUG messages are in the logs:
In case of no tasks to be submitted for a stage, a DEBUG message shows up in the logs.
For ShuffleMapStage:
For ResultStage:
Caution FIXME Review + why does the method return a sequence of TaskLocations?
Stopping
When a DAGScheduler stops (via stop() ), it stops the internal dag-scheduler-message
thread pool, dag-scheduler-event-loop, and TaskScheduler.
Metrics
Spark’s DAGScheduler uses Spark Metrics System (via DAGSchedulerSource ) to report
metrics about internal status.
DAGScheduler 142
Mastering Apache Spark
Settings
spark.test.noStageRetry (default: false ) - if enabled, FetchFailed will not cause stage
DAGScheduler 143
Mastering Apache Spark
Jobs
A job (aka action job or active job) is a top-level work item (computation) submitted to
DAGScheduler to compute the result of an action.
A job starts with a single target RDD, but can ultimately include other RDDs that are all part
of the target RDD’s lineage graph.
FIXME
Caution
Where are instances of ActiveJob used?
Jobs 144
Mastering Apache Spark
A job can be one of two logical types (that are only distinguished by an internal finalStage
field of ActiveJob ):
Map-stage job that computes the map output files for a ShuffleMapStage (for
submitMapStage ) before any downstream stages are submitted.
It is also used for adaptive query planning, to look at map output statistics before
submitting later stages.
Jobs track how many partitions have already been computed (using finished array of
Boolean elements).
Jobs 145
Mastering Apache Spark
Stages
Introduction
A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of
a function executed as part of a Spark job.
A stage can only work on the partitions of a single RDD (identified by rdd ), but can be
associated with many other dependent parent stages (via internal field parents ), with the
boundary of a stage marked by shuffle dependencies.
Submitting a stage can therefore trigger execution of a series of dependent parent stages
(refer to RDDs, Job Execution, Stages, and Partitions).
Figure 2. Submitting a job triggers execution of the stage and its parent stages
Finally, every stage has a firstJobId that is the id of the job that submitted the stage.
ShuffleMapStage is an intermediate stage (in the execution DAG) that produces data for
other stage(s). It writes map output files for a shuffle. It can also be the final stage in a
job in adaptive query planning.
Stages 146
Mastering Apache Spark
ResultStage is the final stage that executes a Spark action in a user program by running
a function on an RDD.
When a job is submitted, a new stage is created with the parent ShuffleMapStages linked —
they can be created from scratch or linked to, i.e. shared, if other jobs use them already.
DAGScheduler splits up a job into a collection of stages. Each stage contains a sequence of
narrow transformations that can be completed without shuffling the entire data set,
separated at shuffle boundaries, i.e. where shuffle occurs. Stages are thus a result of
breaking the RDD graph at shuffle boundaries.
Stages 147
Mastering Apache Spark
In the end, every stage will have only shuffle dependencies on other stages, and may
compute multiple operations inside it. The actual pipelining of these operations happens in
the RDD.compute() functions of various RDDs, e.g. MappedRDD , FilteredRDD , etc.
Stages 148
Mastering Apache Spark
At some point of time in a stage’s life, every partition of the stage gets transformed into a
task - ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage , respectively.
Partitions are computed in jobs, and result stages may not always need to compute all
partitions in their target RDD, e.g. for actions like first() and lookup() .
DAGScheduler prints the following INFO message when there are tasks to submit:
When no tasks in a stage can be submitted, the following DEBUG message shows in the
logs:
FIXME
FIXME Why do stages have numTasks ? Where is this used? How does this
Caution
correspond to the number of partitions in a RDD?
ShuffleMapStage
A ShuffleMapStage (aka shuffle map stage, or simply map stage) is an intermediate
stage in the execution DAG that produces data for shuffle operation. It is an input for the
other following stages in the DAG of stages. That is why it is also called a shuffle
dependency’s map side (see ShuffleDependency)
ShuffleMapStages usually contain multiple pipelined operations, e.g. map and filter ,
before shuffle operation.
Stages 149
Mastering Apache Spark
When executed, ShuffleMapStages save map output files that can later be fetched by
reduce tasks.
The number of the partitions of an RDD is exactly the number of the tasks in a
ShuffleMapStage.
The output locations ( outputLocs ) of a ShuffleMapStage are the same as used by its
ShuffleDependency. Output locations can be missing, i.e. partitions have not been cached or
are lost.
ShuffleMapStages are registered to DAGScheduler that tracks the mapping of shuffles (by
their ids from SparkContext) to corresponding ShuffleMapStages that compute them, stored
in shuffleToMapStage .
getAncestorShuffleDependencies
cleanupStateForJobAndIndependentStages
handleExecutorLost
newShuffleMapStage - the proper way to create shuffle map stages (with the additional
setup steps)
MapStageSubmitted
Stages 150
Mastering Apache Spark
FIXME
ResultStage
A ResultStage is the final stage in running any job that applies a function on some partitions
of the target RDD to compute the result of an action.
Stages 151
Mastering Apache Spark
ShuffleMapStage Sharing
ShuffleMapStages can be shared across multiple jobs, if these jobs reuse the same RDDs.
1. Shuffle at sortByKey()
3. Intentionally repeat the last action that submits a new job with two stages with one being
shared as already-being-computed
Stages 152
Mastering Apache Spark
Stage.findMissingPartitions
Stage.findMissingPartitions() calculates the ids of the missing partitions, i.e. partitions for
which the ActiveJob knows they are not finished (and so they are missing).
A ResultStage stage knows it by querying the active job about partition ids ( numPartitions )
that are not finished (using ActiveJob.finished array of booleans).
Stage.failedOnFetchAndShouldAbort
Stage.failedOnFetchAndShouldAbort(stageAttemptId: Int): Boolean checks whether the
Stages 153
Mastering Apache Spark
Stages 154
Mastering Apache Spark
Task Schedulers
A Task Scheduler schedules tasks for a single Spark application according to scheduling
mode (aka order task policy).
TaskScheduler Contract
submitTasks(taskSet: TaskSet)
executorHeartbeatReceived
Available Implementations
Spark comes with the following task schedulers:
TaskSchedulerImpl
Schedulable Contract
Schedulable is an interface for schedulable entities.
Is identified by name
Pool
Pool is a Schedulable. It requires a name, a scheduling mode, initial minShare and weight.
TaskContext
TaskContext is an abstract class that holds contextual information about a task.
attemptNumber is to denote how many times the task has been attempted (starting from
0).
addTaskCompletionListener
addTaskFailureListener
name which are associated with the instance which runs the task.
You can read and change the TaskContext instance for a task using
org.apache.spark.TaskContext.get() .
import org.apache.spark.TaskContext
TaskContextImpl
TaskContextImpl is the sole implementation of TaskContext abstract class.
Caution FIXME
stage
partition
task attempt
attempt number
runningLocally = false
TaskMemoryManager
Caution FIXME
TaskMetrics
Caution FIXME
Tasks
Task (aka command) is an individual unit of work for executors to run.
It is an individual unit of physical execution (computation) that runs on a single machine for
parts of your Spark application on a data. All tasks in a stage should be completed before
moving on to another stage.
A Task belongs to a single stage and operates on a single partition (a part of an RDD).
Tasks are spawned one by one for each stage and data partition.
ShuffleMapTask that executes a task and divides the task’s output to multiple buckets
(based on the task’s partitioner). See ShuffleMapTask.
ResultTask that executes a task and sends the task’s output back to the driver
application.
The very last stage in a job consists of multiple ResultTasks , while earlier stages consist of
ShuffleMapTask's.
Task Execution
Consult TaskRunner.
Tasks 159
Mastering Apache Spark
Task States
A task can be in one of the following states:
LAUNCHING
RUNNING
FINISHED
LOST
ShuffleMapTask
A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a
partitioner specified in ShuffleDependency).
Tasks 160
Mastering Apache Spark
TaskSets
Introduction
A TaskSet is a collection of tasks that belong to a single stage and a stage attempt. It has
also priority and properties attributes. Priority is used in FIFO scheduling mode (see
Priority Field and FIFO Scheduling) while properties are the properties of the first job in the
stage.
The pair of a stage and a stage attempt uniquely describes a TaskSet and that is what you
can see in the logs when a TaskSet is used:
TaskSet [stageId].[stageAttemptId]
A TaskSet contains a fully-independent sequence of tasks that can run right away based on
the data that is already on the cluster, e.g. map output files from previous stages, though it
may fail if this data becomes unavailable.
removeRunningTask
Caution FIXME Review TaskSet.removeRunningTask(tid)
TaskSchedulerImpl.submitTasks
DAGScheduler.taskSetFailed
DAGScheduler.handleTaskSetFailed
TaskSchedulerImpl.createTaskSetManager
TaskSets 161
Mastering Apache Spark
A TaskSet has priority field that turns into the priority field’s value of TaskSetManager
(which is a Schedulable).
The priority field is used in FIFOSchedulingAlgorithm in which equal priorities give stages
an advantage (not to say priority).
Effectively, the priority field is the job’s id of the first job this stage was part of (for FIFO
scheduling).
TaskSets 162
Mastering Apache Spark
TaskSetManager
TaskSetManager manages execution of the tasks in a single TaskSet (after having it been
handed over by TaskScheduler).
TaskSetManager 163
Mastering Apache Spark
TaskSetManager is Schedulable
TaskSetManager is a Schedulable with the following implementation:
weight is always 1 .
minShare is always 0 .
name is TaskSet_[taskSet.stageId.toString]
executorLost
checkSpeculatableTasks
itself.
TaskSetManager.checkSpeculatableTasks
checkSpeculatableTasks checks whether there are speculatable tasks in the TaskSet.
TaskSetManager 164
Mastering Apache Spark
checkSpeculatableTasks is called by
Note
TaskSchedulerImpl.checkSpeculatableTasks.
It then checks whether the number is equal or greater than the number of tasks completed
successfully (using tasksSuccessful ).
Having done that, it computes the median duration of all the successfully completed tasks
(using taskInfos ) and task length threshold using the median duration multiplied by
spark.speculation.multiplier that has to be equal or less than 100 .
For each task (using taskInfos ) that is not marked as successful yet (using successful )
for which there is only one copy running (using copiesRunning ) and the task takes more
time than the calculated threshold, but it was not in speculatableTasks it is assumed
speculatable.
INFO Marking task [index] in stage [taskSet.id] (on [info.host]) as speculatable because it ran more
The task gets added to the internal speculatableTasks collection. The method responds
positively.
TaskSetManager.executorLost
executorLost(execId: String, host: String, reason: ExecutorLossReason) is part of
TaskSetManager 165
Mastering Apache Spark
It first checks whether the TaskSet is for a ShuffleMapStage, i.e. all TaskSet.tasks are
instances of ShuffleMapTask. At the same time, it also checks whether an external shuffle
server is used (using env.blockManager.externalShuffleServiceEnabled ) that could serve the
shuffle outputs.
If the above checks pass, tasks (using taskInfos ) are checked for their executors and
whether the failed one is among them.
For each task that executed on the failed executor, if it was not marked as successful
already (using successful ), it is:
the task gets added to the collection of pending tasks (using addPendingTask method)
For the other cases, i.e. if the TaskSet is for ResultStage or an external shuffle service is
used, for all running tasks for the failed executor, handleFailedTask is called with the task
state being FAILED .
TaskSetManager.executorAdded
executorAdded simply calls recomputeLocality method.
TaskSetManager.recomputeLocality
recomputeLocality (re)computes locality levels as a indexed collection of task localities, i.e.
Array[TaskLocality.TaskLocality] .
It checks whether pendingTasksForExecutor has at least one element, and if so, it looks up
spark.locality.wait.* for PROCESS_LOCAL and checks whether there is an executor for which
TaskSchedulerImpl.isExecutorAlive is true . If the checks pass, PROCESS_LOCAL becomes
TaskSetManager 166
Mastering Apache Spark
task localities.
Then, the method checks pendingTasksWithNoPrefs and if it’s not empty, NO_PREF becomes
an element of the levels collection.
If pendingTasksForRack is not empty, and the wait time for RACK_LOCAL is defined, and there
is an executor for which TaskSchedulerImpl.hasHostAliveOnRack is true , RACK_LOCAL is
added to the levels collection.
Right before the method finishes, it prints out the following DEBUG to the logs:
TaskSetManager.resourceOffer
FIXME Review TaskSetManager.resourceOffer + Does this have anything
Caution
related to the following section about scheduling tasks?
For every TaskSet submitted for execution, TaskSchedulerImpl creates a new instance of
TaskSetManager. It then calls SchedulerBackend.reviveOffers() (refer to submitTasks).
From TaskSetManager.resourceOffer :
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.4, partition 0,PROCESS_LOCAL, 1
If a serialized task is bigger than 100 kB (it is not a configurable value), a WARN message
is printed out to the logs (only once per taskset):
TaskSetManager 167
Mastering Apache Spark
WARN TaskSetManager: Stage [task.stageId] contains a task of very large size ([serializedTask.limit /
INFO TaskSetManager: Starting task [id] in stage [taskSet.id] (TID [taskId], [host], partition [task.
For example:
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 205
Caution FIXME
TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks
in the taskset.
TaskSetManager keeps track of the tasks pending execution per executor, host, rack or with
no locality preferences.
TaskSetManager 168
Mastering Apache Spark
TaskSetManager computes locality levels for the TaskSet for delay scheduling. While
computing you should see the following DEBUG in the logs:
Events
When a task has finished, TaskSetManager calls DAGScheduler.taskEnded.
Caution FIXME
TaskSetManager.handleSuccessfulTask
handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]) method marks the task (by
tid ) as successful and notifies the DAGScheduler that the task has ended.
It removes the task from runningTasksSet . It also decreases the number of running tasks in
the parent pool if it is defined (using parent and Pool.decreaseRunningTasks ).
If the task was not marked as successful already (using successful ), tasksSuccessful is
incremented and the following INFO message appears in the logs:
INFO Finished task [info.id] in stage [taskSet.id] (TID [info.taskId]) in [info.duration] ms on [info
It also marks the task as successful (using successful ). Finally, if the number of tasks
finished successfully is exactly the number of tasks the TaskSetManager manages, the
TaskSetManager turns zombie.
Otherwise, when the task was already marked as successful, the following INFO message
appears in the logs:
TaskSetManager 169
Mastering Apache Spark
INFO Ignoring task-finished event for [info.id] in stage [taskSet.id] because task [index] has alread
failedExecutors.remove(index) is called.
At the end, the method checks whether the TaskSetManager is a zombie and no task is
running (using runningTasksSet ), and if so, it calls TaskSchedulerImpl.taskSetFinished.
TaskSetManager.handleFailedTask
handleFailedTask(tid: Long, state: TaskState, reason: TaskEndReason) method is called by
TaskSchedulerImpl or executorLost.
The method first checks whether the task has already been marked as failed (using
taskInfos) and if it has, it quits.
It removes the task from runningTasksSet and informs the parent pool to decrease its
running tasks.
It marks the TaskInfo as failed and grabs its index so the number of copies running of the
task is decremented (see copiesRunning).
The method calculates the failure exception to report per TaskEndReason . See below for the
possible cases of TaskEndReason.
If the TaskSetManager is not a zombie, and the task was not KILLED , and the task failure
should be counted towards the maximum number of times the task is allowed to fail before
the stage is aborted ( TaskFailedReason.countTowardsTaskFailures is true ), numFailures is
TaskSetManager 170
Mastering Apache Spark
incremented and if the number of failures of the task equals or is greater than assigned to
the TaskSetManager ( maxTaskFailures ), the ERROR appears in the logs:
ERROR Task [id] in stage [id] failed [maxTaskFailures] times; aborting job
FetchFailed
For FetchFailed , it logs WARNING:
WARNING Lost task [id] in stage [id] (TID [id], [host]): [reason.toErrorString]
Unless it has already been marked as successful (in successful), the task becomes so and
tasksSuccessful is incremented.
No exception is returned.
ExceptionFailure
For ExceptionFailure , it grabs TaskMetrics if available.
ERROR Task [id] in stage [id] (TID [tid]) had a not serializable result: [exception.description]; not
If the description, i.e. the ExceptionFailure, has already been reported (and is therefore a
duplication), spark.logging.exceptionPrintInterval is checked before reprinting the duplicate
exception in full.
For full printout of the ExceptionFailure, the following WARNING appears in the logs:
TaskSetManager 171
Mastering Apache Spark
WARNING Lost task [id] in stage [id] (TID [id], [host]): [reason.toErrorString]
INFO Lost task [id] in stage [id] (TID [id]) on executor [host]: [ef.className] ([ef.description]) [d
ExecutorLostFailure
For ExecutorLostFailure if not exitCausedByApp , the following INFO appears in the logs:
INFO Task [tid] failed because while it was being computed, its executor exited for a reason unrelate
Other TaskFailedReasons
For the other TaskFailedReasons, the WARNING appears in the logs:
WARNING Lost task [id] in stage [id] (TID [id], [host]): [reason.toErrorString]
Other TaskEndReason
For the other TaskEndReasons, the ERROR appears in the logs:
Caution FIXME
Up to spark.task.maxFailures attempts
TaskSetManager 172
Mastering Apache Spark
In Spark shell with local master, spark.task.maxFailures is fixed to 1 and you need to use
local-with-retries master to change it to some other value.
In the following example, you are going to execute a job with two partitions and keep one
failing at all times (by throwing an exception). The aim is to learn the behavior of retrying
task execution in a stage in TaskSet. You will only look at a single task execution, namely
0.0 .
Zombie state
TaskSetManager 173
Mastering Apache Spark
TaskSetManager enters zombie state when all tasks in a taskset have completed
successfully (regardless of the number of task attempts), or if the task set has been aborted
(see Aborting TaskSet).
While in zombie state, TaskSetManager can launch no new tasks and responds with no
TaskDescription to resourceOffers.
TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to
continue to track and account for the running tasks.
Internal Registries
copiesRunning
successful
numFailures
taskAttempts
tasksSuccessful
weight (default: 1 )
minShare (default: 0 )
TaskSetManager 174
Mastering Apache Spark
parent
totalResultSize
calculatedTasks
runningTasksSet
pendingTasksForExecutor
pendingTasksForHost
pendingTasksForRack
pendingTasksWithNoPrefs
allPendingTasks
speculatableTasks
recentExceptions
Settings
spark.scheduler.executorTaskBlacklistTime (default: 0L ) - time interval to pass after
which a task can be re-launched on the executor where it has once failed. It is to
prevent repeated task failures due to executor failures.
TaskSetManager 175
Mastering Apache Spark
TaskSetManager 176
Mastering Apache Spark
When your Spark application starts, i.e. at the time when an instance of SparkContext is
created, TaskSchedulerImpl with a SchedulerBackend and DAGScheduler are created and
started, too.
TaskSchedulerImpl.initialize
initialize(backend: SchedulerBackend) method is responsible for initializing a
TaskSchedulerImpl .
It has to be called after a TaskSchedulerImpl is created and before it can be started to know
about the SchedulerBackend to use.
Besides being assigned the instance of SchedulerBackend , it sets up rootPool (as part of
TaskScheduler Contract). It uses a scheduling mode (as configured by
spark.scheduler.mode) to build a SchedulableBuilder and call
SchedulableBuilder.buildPools() .
TaskSchedulerImpl.start
When TaskSchedulerImpl is started (using start() method), it starts the scheduler
backend it manages.
TaskSchedulerImpl.stop
When TaskSchedulerImpl is stopped (using stop() method), it does the following:
Shuts down the internal task-scheduler-speculation thread pool executor (used for
Speculative execution of tasks).
Stops SchedulerBackend.
Stops TaskResultGetter.
When enabled, you should see the following INFO message in the logs:
The job with speculatable tasks should finish while speculative tasks are running, and it will
leave these tasks running - no KILL command yet.
It uses checkSpeculatableTasks method that asks rootPool to check for speculatable tasks.
If there are any, SchedulerBackend is called for reviveOffers.
FIXME How does Spark handle repeated results of speculative tasks since
Caution
there are copies launched?
Figure 3. TaskSchedulerImpl.submitTasks
If there are tasks to launch for missing partitions in a stage, DAGScheduler
Note
executes submitTasks (see submitMissingTasks for Stage and Job).
When this method is called, you should see the following INFO message in the logs:
It creates a new TaskSetManager for the given TaskSet taskSet and the acceptable
number of task failures (as maxTaskFailures ).
more than one active taskSet for stage [stage]: [TaskSet ids]
When the method is called the first time ( hasReceivedTask is false ) in cluster mode,
starvationTimer is scheduled at fixed rate, i.e. every spark.starvation.timeout after the first
Every time the starvation timer thread is executed, it checks whether hasLaunchedTask is
false , and logs the WARNING:
WARNING Initial job has not accepted any resources; check your cluster UI to ensure that workers are
LocalBackend (for local mode) to offer free resources available on the executors to run tasks
on.
For each offer, the resourceOffers method tracks hosts per executor id (using
executorIdToHost ) and sets 0 as the number of tasks running on the executor if there is
no tasks running already (using executorIdToTaskCount ). It also tracks executor id per host.
FIXME BUG? Why is executorAdded called for a new host added? Can’t we
Warning have more executors on a host? The name of the method is misleading
then.
Moreover, if a new host was added to the pool (using newExecAvail - FIXME when
exactly?), TaskSetManagers get informed about the new executor (using
TaskSetManager.executorAdded()).
FIXME BUG? Why is the name newExecAvail since it’s called for a new
Warning host added? Can’t we have more executors on a host? The name of the
method could be misleading.
Check whether the number of cores in an offer is more than the number of cores needed for
a task (using spark.task.cpus).
TaskResultGetter
TaskResultGetter is a helper class for TaskSchedulerImpl.statusUpdate. It asynchronously
fetches the task results of tasks that have finished successfully (using
enqueueSuccessfulTask) or fetches the reasons of failures for failed tasks (using
enqueueFailedTask). It then sends the "results" back to TaskSchedulerImpl .
Tip Consult Task States in Tasks to learn about the different task states.
enqueueSuccessfulTask
enqueueFailedTask
The methods use the internal (daemon thread) thread pool task-result-getter (as
getTaskResultExecutor ) with spark.resultGetter.threads so they can be executed
asynchronously.
TaskResultGetter.enqueueSuccessfulTask
enqueueSuccessfulTask(taskSetManager: TaskSetManager, tid: Long, serializedData:
SparkEnv.closureSerializer ).
FIXME Review
Caution
taskSetManager.canFetchMoreResults(serializedData.limit()) .
sparkEnv.blockManager.getRemoteBytes(blockId) .
TaskResultGetter.enqueueFailedTask
Any ClassNotFoundException leads to printing out the ERROR message to the logs:
TaskSchedulerImpl.statusUpdate
statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) is called by
scheduler backends to inform about task state changes (see Task States in Tasks).
It is called by:
When statusUpdate starts, it checks the current state of the task and act accordingly.
If a task became TaskState.LOST and there is still an executor assigned for the task (it
seems it may not given the check), the executor is marked as lost (or sometimes called
failed). The executor is later announced as such using DAGScheduler.executorLost with
SchedulerBackend.reviveOffers() being called afterwards.
The method looks up the TaskSetManager for the task (using taskIdToTaskSetManager ).
When the TaskSetManager is found and the task is in finished state, the task is removed
from the internal data structures, i.e. taskIdToTaskSetManager and taskIdToExecutorId , and
the number of currently running tasks for the executor(s) is decremented (using
executorIdToTaskCount ).
For a task in FAILED , KILLED , or LOST state, TaskSet.removeRunningTask is called (as for
the FINISHED state) and then TaskResultGetter.enqueueFailedTask.
If the TaskSetManager could not be found, the following ERROR shows in the logs:
ERROR Ignoring update with state [state] for TID [tid] because its task set is gone (this is likely t
TaskSchedulerImpl.handleFailedTask
TaskSchedulerImpl.handleFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState:
It calls TaskSetManager.handleFailedTask.
If the TaskSetManager is not a zombie and the task’s state is not KILLED ,
SchedulerBackend.reviveOffers is called.
TaskSchedulerImpl.taskSetFinished
taskSetFinished(manager: TaskSetManager) method is called to inform TaskSchedulerImpl
INFO Removed TaskSet [manager.taskSet.id], whose tasks have all completed, from pool [manager.parent.
Scheduling Modes
The scheduling mode in TaskSchedulerImpl is configured by spark.scheduler.mode setting
that can be one of the following values:
FIFO with no pools; one root pool with instances of TaskSetManager; lower priority gets
Schedulable sooner or earlier stage wins.
FAIR
See SchedulableBuilder.
SchedulableBuilder
Caution FIXME
buildPools()
SchedulableBuilder.addTaskSetManager is called by
Note
TaskSchedulerImpl.submitTasks when a TaskSet is submitted for execution.
There are two implementations available (click the links to go to their sections):
FIFOSchedulableBuilder
FairSchedulableBuilder
FIFOSchedulableBuilder
FIFOSchedulableBuilder is a very basic SchedulableBuilder .
FairSchedulableBuilder
Caution FIXME
TaskSchedulerImpl.executorAdded
executorAdded(execId: String, host: String) method simply passes the notification along to
Internal Registries
Settings
spark.task.maxFailures (default: 4 for cluster mode and 1 for local except local-
with-retries) - The number of individual task failures before giving up on the entire
TaskSet and the job afterwards.
You cannot have different number of CPUs per task in a single SparkContext.
to Scheduling Modes.
tasks.
Scheduler Backends
Introduction
Spark comes with a pluggable backend mechanism called scheduler backend (aka
backend scheduler) to support various cluster managers, e.g. Apache Mesos, Hadoop
YARN or Spark’s own Spark Standalone and Spark local.
These cluster managers differ by their custom task scheduling modes and resource offers
mechanisms, and Spark’s approach is to abstract the differences in SchedulerBackend
Contract.
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an
application" gets resource offers as machines become available and can launch tasks on
them. Once a scheduler backend obtains the resource allocation, it can start executors.
SchedulerBackend Contract
Do reviveOffers
Answers isReady() to tell about its started/stopped state. It returns true by default.
reviveOffers
Note It is used in TaskSchedulerImpl using backend internal reference.
applicationAttemptId
applicationAttemptId(): Option[String] returns no application attempt id.
It is currently only supported by YARN cluster scheduler backend as the YARN cluster
manager supports multiple attempts.
getDriverLogUrls
getDriverLogUrls: Option[Map[String, String]] returns no URLs by default.
Available Implementations
Spark comes with the following scheduler backends:
CoarseGrainedSchedulerBackend
YarnSchedulerBackend
YarnClientSchedulerBackend
CoarseMesosSchedulerBackend
SimrSchedulerBackend
MesosSchedulerBackend
CoarseGrainedSchedulerBackend
CoarseGrainedSchedulerBackend is a scheduler backend that waits for coarse-grained
executors to connect to it (using RPC Environment) and launch tasks. It talks to a cluster
manager for resources for executors (register, remove).
This backend holds executors for the duration of the Spark job rather than relinquishing
executors whenever a task is done and asking the scheduler to launch a new executor for
each new task.
It tracks:
executors ( ExecutorData )
log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=DEBUG
Launching Tasks
CoarseGrainedSchedulerBackend 192
Mastering Apache Spark
If the serialized task’s size is over the frame size, the TaskSetManager of the task is called
using TaskSetManager.abort method. It is TaskSchedulerImpl to be queried for the
TaskSetManager of the task.
FIXME At that point, tasks have their executor assigned. When and how did
Caution
that happen?
The number of free cores of the executor for the task is decremented (by spark.task.cpus)
and LaunchTask message is sent to it.
reviveOffers
It sends ReviveOffers message to driverEndpoint .
Stopping
stop() method stops CoarseGrainedScheduler RPC Endpoint and executors (using
stopExecutors() ).
CoarseGrainedSchedulerBackend 193
Mastering Apache Spark
It starts by ensuring that sufficient resources are available to launch tasks using its own
sufficientResourcesRegistered method. It gives custom implementations of itself a way to
In case there are no sufficient resources available yet (the above requirement does not
hold), it checks whether the time from the startup passed
spark.scheduler.maxRegisteredResourcesWaitingTime to give a way to submit tasks
(despite minRegisteredRatio not being reached yet).
It tracks:
Executor addresses (host and port) for executors ( addressToExecutorId ) - it is set when
an executor connects to register itself. See RegisterExecutor RPC message.
CoarseGrainedSchedulerBackend 194
Mastering Apache Spark
Total number of core count ( totalCoreCount ) - the sum of all cores on all executors.
See RegisterExecutor RPC message.
RPC message.
RPC Messages
RemoveExecutor
RetrieveSparkProps
ReviveOffers
ReviveOffers calls makeOffers() that makes fake resource offers on all executors that are
alive.
Caution FIXME When is an executor alive? What other states can an executor be in?
StopDriver
CoarseGrainedSchedulerBackend 195
Mastering Apache Spark
StopExecutors
StopExecutors message is receive-reply and blocking. When received, the following INFO
RegisterExecutor
RegisterExecutor(executorId, executorRef, hostPort, cores, logUrls) is sent by
When numPendingExecutors is more than 0 , the following is printed out to the logs:
CoarseGrainedSchedulerBackend 196
Mastering Apache Spark
Settings
spark.akka.frameSize (default: 128 and not greater than 2047m - 200k for extra data
in an Akka message) - largest frame size for Akka messages (serialized tasks or task
results) in MB.
and 1 (including) that controls the minimum ratio of (registered resources / total
expected resources) before submitting tasks. See isReady.
CoarseGrainedSchedulerBackend 197
Mastering Apache Spark
Executor Backends
Executor Backend is a pluggable interface used by executors to send status updates to a
cluster scheduler.
It is effectively a bridge between the driver and an executor, i.e. there are two endpoints
running.
Status updates include information about tasks, i.e. id, state, and data (as ByteBuffer ).
At startup, an executor backend connects to the driver and creates an executor. It then
launches and kills tasks. It stops when the driver orders so.
CoarseGrainedExecutorBackend
MesosExecutorBackend
MesosExecutorBackend
Caution FIXME
CoarseGrainedExecutorBackend
CoarseGrainedExecutorBackend is an executor backend for coarse-grained executors
that live until it terminates.
When Executor RPC Endpoint is started ( onStart ), it prints out INFO message to the logs:
All task status updates are sent along to driverRef as StatusUpdate messages.
CoarseGrainedExecutorBackend 199
Mastering Apache Spark
log4j.logger.org.apache.spark.executor.CoarseGrainedExecutorBackend=INFO
Driver’s URL
The driver’s URL is of the format spark://[RpcEndpoint name]@[hostname]:[port] , e.g.
spark://CoarseGrainedScheduler@192.168.1.6:64859 .
main
CoarseGrainedExecutorBackend is a command-line application (it comes with main
method).
Unrecognized options or required options missing cause displaying usage help and exit.
CoarseGrainedExecutorBackend 200
Mastering Apache Spark
$ ./bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend
Options are:
--driver-url <driverUrl>
--executor-id <executorId>
--hostname <hostname>
--cores <cores>
--app-id <appid>
--worker-url <workerUrl>
--user-class-path <url>
It sends a (blocking) RetrieveSparkProps message to the driver (using the value for
driverUrl command-line option). When the response (the driver’s SparkConf ) arrives it
adds spark.app.id (using the value for appid command-line option) and creates a brand
new SparkConf .
Caution FIXME
Usage
It is used in:
SparkDeploySchedulerBackend
CoarseMesosSchedulerBackend
SparkClassCommandBuilder - ???
ExecutorRunnable
RPC Messages
CoarseGrainedExecutorBackend 201
Mastering Apache Spark
RegisteredExecutor(hostname)
RegisteredExecutor(hostname) is received to confirm successful registration to a driver. This
RegisterExecutorFailed(message)
RegisterExecutorFailed(message) is to inform that registration to a driver failed. It exits
LaunchTask(data)
LaunchTask(data) checks whether an executor has been created and prints the following
ERROR if not:
KillTask(taskId, _, interruptThread)
KillTask(taskId, _, interruptThread) message kills a task (calls Executor.killTask ).
If an executor has not been initialized yet (FIXME: why?), the following ERROR message is
printed out to the logs and CoarseGrainedExecutorBackend exits:
StopExecutor
StopExecutor message handler is receive-reply and blocking. When received, the handler
CoarseGrainedExecutorBackend 202
Mastering Apache Spark
Shutdown
Shutdown stops the executor, itself and RPC Environment.
CoarseGrainedExecutorBackend 203
Mastering Apache Spark
Shuffle Manager
Spark comes with a pluggable mechanism for shuffle systems.
Shuffle Manager (aka Shuffle Service) is a Spark service that tracks shuffle dependencies
for ShuffleMapStage. The driver and executors all have their own Shuffle Service.
The driver registers shuffles with a shuffle manager, and executors (or tasks running locally
in the driver) can ask to read and write data.
There can be many shuffle services running simultaneously and a driver registers with all of
them when CoarseGrainedSchedulerBackend is used.
The name appears here, twice in the build’s output and others.
Get the gist of "The shuffle files are not currently cleaned up when using Spark on
Mesos with the external shuffle service"
ShuffleManager Contract
Available Implementations
Spark comes with two implementations of ShuffleManager contract:
SortShuffleManager
SortShuffleManager is a shuffle manager with the short name being sort .
You can start an external shuffle service instance using the following command:
./bin/spark-class org.apache.spark.deploy.ExternalShuffleService
When the server starts, you should see the following INFO message in the logs:
INFO ExternalShuffleService: Starting shuffle service on port [port] with useSasl = [useSasl]
log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockHandler=TRACE
log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=DEBUG
You should also see the following messages when a SparkContext is closed:
Settings
spark.shuffle.manager (default: sort ) sets up the default shuffle manager by a short
service should be used. When enabled, the Spark driver will register with the shuffle
service. See External Shuffle Service.
WARN SortShuffleManager: spark.shuffle.spill was set to false, but this configuration is ignored
Block Manager
Block Manager is a key-value store for blocks that acts as a cache. It runs on every node,
i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting
and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and off-
heap.
A BlockManager manages the storage for most of the data in Spark, i.e. block that represent
a cached RDD partition, intermediate shuffle data, broadcast data, etc. See BlockId.
RpcEnv
BlockManagerMaster
Serializer
MemoryManager
MapOutputTracker
ShuffleManager
BlockTransferService
SecurityManager
BlockManager is a BlockDataManager.
You may want to shut off WARN messages being printed out about the current
state of blocks using the following line:
Tip
log4j.logger.org.apache.spark.storage.BlockManager=OFF
BlockManager.initialize
initialize(appId: String) method is called to initialize a BlockManager instance.
Note The method must be called before a BlockManager can be fully operable.
It initializes BlockTransferService
It creates an instance of BlockManagerId given an executor id, host name and port for
BlockTransferService.
It creates the address of the server that serves this executor’s shuffle files (using
shuffleServerId )
If an external shuffle service is used, the following INFO appears in the logs:
At the end, if an external shuffle service is used, and it is not a driver, it registers to the
external shuffle service.
While registering to the external shuffle service, you should see the following INFO message
in the logs:
Any issues while connecting to the external shuffle service are reported as ERROR
messages in the logs:
ERROR Failed to connect to external shuffle server, will retry [attempts] more times after waiting [S
BlockManagerSlaveEndpoint
BlockManagerSlaveEndpoint is a RPC endpoint for remote communication between workers
When a BlockManager is created, it sets up the RPC endpoint with the name
BlockManagerEndpoint[randomId] and BlockManagerSlaveEndpoint as the class to handle
RPC messages.
RPC Messages
log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=DEBUG
BlockManager.removeBroadcast .
BlockManager.getStatus ).
BlockManager.getMatchingBlockIds ).
BlockTransferService
Caution FIXME
ExternalShuffleClient
Caution FIXME
BlockId
BlockId identifies a block of data. It has a globally unique identifier ( name )
Broadcast Values
When a new broadcast value is created, TorrentBroadcast - the default implementation of
Broadcast - blocks are put in the block manager. See TorrentBroadcast.
TRACE Put for block [blockId] took [startTimeMs] to get into synchronized block
It puts the data in the memory first and drop to disk if the memory store can’t hold it.
Stores
A Store is the place where blocks are held.
BlockManagerMaster
Caution FIXME
BlockManagerMaster is the Block Manager that runs on the driver only. It registers itself as
BlockManagerMaster endpoint in RPC Environment.
BlockManagerMaster.registerBlockManager
Caution FIXME
BlockManagerMasterEndpoint
Caution FIXME
RegisterBlockManager
UpdateBlockInfo
GetLocations
GetLocationsMultipleBlockIds
GetPeers
GetRpcHostPortForExecutor
GetMemoryStatus
GetStorageStatus
GetBlockStatus
GetMatchingBlockIds
RemoveRdd
RemoveShuffle
RemoveBroadcast
RemoveBlock
RemoveExecutor
StopBlockManagerMaster
BlockManagerHeartbeat
HasCachedBlocks
BlockManagerId
FIXME
DiskBlockManager
DiskBlockManager creates and maintains the logical mapping between logical blocks and
physical on-disk locations.
By default, one block is mapped to one file with a name given by its BlockId. It is however
possible to have a block map to only a segment of a file.
Block files are hashed among the directories listed in spark.local.dir (or in
SPARK_LOCAL_DIRS if set).
Execution Context
block-manager-future is the execution context for…FIXME
Metrics
Block Manager uses Spark Metrics System (via BlockManagerSource ) to report metrics about
internal status.
Misc
The underlying abstraction for blocks in Spark is a ByteBuffer that limits the size of a block
to 2GB ( Integer.MAX_VALUE - see Why does FileChannel.map take up to
Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has
implication not just for managed blocks in use, but also for shuffle blocks (memory mapped
blocks are limited to 2GB, even though the API allows for long ), ser-deser via byte array-
backed output streams.
When a non-local executor starts, it initializes a Block Manager object for spark.app.id id.
If a task result is bigger than Akka’s message frame size - spark.akka.frameSize - executors
use the block manager to send the result back. Task results are configured using
spark.driver.maxResultSize (default: 1g ).
Settings
spark.shuffle.service.enabled (default: false ) whether an external shuffle service is
variables.
stored serialized.
Settings
spark.fileserver.port (default: 0 ) - the port of a file server
Broadcast Manager
Broadcast Manager is a Spark service to manage broadcast values in Spark jobs. It is
created for a Spark application as part of SparkContext’s initialization and is a simple
wrapper around BroadcastFactory.
Broadcast Manager tracks the number of broadcast values (using the internal field
nextBroadcastId ).
The idea is to transfer values used in transformations from a driver to executors in a most
effective way so they are copied once and used many times by tasks (rather than being
copied every time a task is launched).
BroadcastFactory
BroadcastFactory is a pluggable interface for broadcast implementations in Spark. It is
TorrentBroadcast
The BroadcastManager implementation used in Spark by default is
org.apache.spark.broadcast.TorrentBroadcast (see spark.broadcast.factory). It uses a
Compression
When spark.broadcast.compress is true (default), compression is used.
lz4 or org.apache.spark.io.LZ4CompressionCodec
available.
Settings
spark.broadcast.factory (default:
Compression.
Dynamic Allocation
Tip See the excellent slide deck Dynamic Allocation in Spark from Databricks.
Available since Spark 1.2.0 with many fixes and extensions up to 1.5.0.
Support was first introduced in YARN in 1.2, and then extended to Mesos coarse-
grained mode. It is supported in Standalone mode, too.
In dynamic allocation you get as much as needed and no more. It allows to scale the
number of executors up and down based on workload, i.e. idle executors are removed,
and if you need more executors for pending tasks, you request them.
Exponential increase in number of executors due to slow start and we may need
slightly more.
Metrics
Dynamic Allocation feature uses Spark Metrics System (via
ExecutorAllocationManagerSource ) to report metrics about internal status.
executors / numberExecutorsToAdd
executors / numberExecutorsPendingToRemove
executors / numberAllExecutors
executors / numberTargetExecutors
executors / numberMaxNeededExecutors
Settings
spark.dynamicAllocation.enabled (default: false ) - whether dynamic allocation is
enabled for the given Spark context. It requires that spark.executor.instances (default:
0 ) is 0 .
executors
IDs.
Future
SPARK-4922
SPARK-4751
SPARK-7955
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local)
containing the various blocks of a file or directory as well as their locations (represented as
InputSplits ), and then schedules the work to the SparkWorkers.
Spark tries to execute tasks as close to the data as possible to minimize data transfer (over
the wire).
PROCESS_LOCAL
NODE_LOCAL
NO_PREF
RACK_LOCAL
ANY
Cache Manager
Cache Manager in Spark is responsible for passing RDDs partition contents to Block
Manager and making sure a node doesn’t load two copies of an RDD at once.
Spark uses Akka Actor for RPC and messaging, which in turn uses Netty.
For shuffle data, Netty can be optionally used. By default, NIO is directly used to do
transfer shuffle data.
sparkMaster is the name of Actor System for the master in Spark Standalone, i.e.
spark.akka.threads (default: 4 )
spark.akka.batchSize (default: 15 )
spark.akka.frameSize (default: 128 ) configures max frame size for Akka messages in
bytes
OutputCommitCoordinator
From the scaladoc (it’s a private[spark] class so no way to find it outside the code):
Authority that decides whether tasks can commit output to HDFS. Uses a "first
committer wins" policy. OutputCommitCoordinator is instantiated in both the drivers and
executors. On executors, it is configured with a reference to the driver’s
OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to
the driver’s OutputCommitCoordinator.
This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull
requests) for an extensive design discussion.
OutputCommitCoordinator 224
Mastering Apache Spark
stops them
A RPC Environment is defined by the name, host, and port. It can also be controlled by a
security manager.
RpcEndpointRefs can be looked up by name or uri (because different RpcEnvs may have
different naming schemes).
RpcEnvFactory
Spark comes with ( private[spark] trait ) RpcEnvFactory which is the factory contract to
create a RPC Environment.
You can choose an RPC implementation to use by spark.rpc (default: netty ). The setting
can be one of the two short names for the known RpcEnvFactories - netty or akka - or a
fully-qualified class name of your custom factory (including Netty-based and Akka-based
implementations).
RpcEndpoint
RpcEndpoints define how to handle messages (what functions to execute given a
message). RpcEndpoints live inside RpcEnv after being registered by a name.
ThreadSafeRpcEndpoint .
RpcEndpointRef
It is serializable entity and so you can send it over a network or save it for later use (it can
however be deserialized using the owning RpcEnv only).
You can send asynchronous one-way messages to the corresponding RpcEndpoint using
send method.
You can send a semi-synchronous message, i.e. "subscribe" to be notified when a response
arrives, using ask method. You can also block the current calling thread for a response
using askWithRetry method.
RpcAddress
RpcAddress is the logical address for an RPC Environment, with hostname and port.
RpcEndpointAddress
RpcEndpointAddress is the logical address for an endpoint registered to an RPC
Environment, with RpcAddress and name.
It is a prioritized list of lookup timeout properties (the higher on the list, the more important):
spark.rpc.lookupTimeout
spark.network.timeout
Their value can be a number alone (seconds) or any number with time suffix, e.g. 50s ,
100ms , or 250us . See Settings.
You can control the time to wait for a response using the following settings (in that order):
spark.rpc.askTimeout
spark.network.timeout
Their value can be a number alone (seconds) or any number with time suffix, e.g. 50s ,
100ms , or 250us . See Settings.
Exceptions
When RpcEnv catches uncaught exceptions, it uses RpcCallContext.sendFailure to send
exceptions back to the sender, or logging them if no such sender or
NotSerializableException .
If any error is thrown from one of RpcEndpoint methods except onError , onError will be
invoked with the cause. If onError throws an error, RpcEnv will ignore it.
RpcEnvConfig
RpcEnvConfig is a placeholder for an instance of SparkConf, the name of the RPC
Environment, host and port, a security manager, and clientMode.
RpcEnv.create
You can create a RPC Environment using the helper method RpcEnv.create .
It assumes that you have a RpcEnvFactory with an empty constructor so that it can be
created via Reflection that is available under spark.rpc setting.
Settings
spark.rpc (default: netty since Spark 1.6.0-SNAPSHOT) - the RPC implementation
spark.rpc.lookupTimeout (default: 120s ) - the default timeout to use for RPC remote
spark.network.timeout (default: 120s ) - the default network timeout to use for RPC
spark.rpc.askTimeout (default: 120s ) - the default timeout to use for RPC ask
Others
The Worker class calls startRpcEnvAndEndpoint with the following configuration options:
host
port
webUiPort
cores
memory
masters
workDir
Netty-based RpcEnv
Read RPC Environment (RpcEnv) about the concept of RPC Environment in
Tip
Spark.
When NettyRpcEnv starts, the following INFO message is printed out in the logs:
Client Mode
Refer to Client Mode = is this an executor or the driver? for introduction about client mode.
This is only for Netty-based RpcEnv.
When created, a Netty-based RpcEnv starts the RPC server and register necessary
endpoints for non-client mode, i.e. when client mode is false .
It means that the required services for remote communication with NettyRpcEnv are only
started on the driver (not executors).
Thread Pools
shuffle-server-ID
EventLoopGroup uses a daemon thread pool called shuffle-server-ID , where ID is a
dispatcher-event-loop-ID
NettyRpcEnv’s Dispatcher uses the daemon fixed thread pool with
spark.rpc.netty.dispatcher.numThreads threads.
netty-rpc-env-timeout
NettyRpcEnv uses the daemon single-thread scheduled thread pool netty-rpc-env-timeout .
netty-rpc-connection-ID
NettyRpcEnv uses the daemon cached thread pool with up to spark.rpc.connect.threads
threads.
Settings
spark.rpc.io.mode (default: NIO ) - NIO or EPOLL for low-level IO. NIO is always
io.netty.channel.epoll.EpollEventLoopGroup .
JVM)
controls the maximum number of binding attempts/retries to a port before giving up.
Endpoints
endpoint-verifier ( RpcEndpointVerifier ) - a RpcEndpoint for remote RpcEnvs to
query whether an RpcEndpoint exists or not. It uses Dispatcher that keeps track of
registered endpoints and responds true / false to CheckExistence message.
endpoint-verifier is used to check out whether a given endpoint exists or not before the
One use case is when an AppClient connects to standalone Masters before it registers the
application it acts for.
Message Dispatcher
A message dispatcher is responsible for routing RPC messages to the appropriate
endpoint(s).
ContextCleaner
It does cleanup of shuffles, RDDs and broadcasts.
It uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast
states (using keepCleaning method).
Settings
spark.cleaner.referenceTracking (default: true ) controls whether to enable or not
cleaning thread will block on cleanup tasks (other than shuffle, which is controlled by the
spark.cleaner.referenceTracking.blocking.shuffle parameter).
ContextCleaner 234
Mastering Apache Spark
MapOutputTracker
A MapOutputTracker is a Spark service to track the locations of the (shuffle) map outputs of
a stage. It uses an internal MapStatus map with an array of MapStatus for every partition for
a shuffle id.
It works with ShuffledRDD when it asks for preferred locations for a shuffle using
tracker.getPreferredLocationsForShuffle .
MapStatus
A MapStatus is the result returned by a ShuffleMapTask to DAGScheduler that includes:
MapOutputTracker 235
Mastering Apache Spark
an estimated size for the reduce block, in bytes (as def getSizeForBlock(reduceId:
Int): Long ).
When the number of blocks (the size of uncompressedSizes ) is greater than 2000,
HighlyCompressedMapStatus is chosen.
Caution FIXME What exactly is 2000? Is this the number of tasks in a job?
Epoch Number
Caution FIXME
MapOutputTrackerMaster
A MapOutputTrackerMaster is the MapOutputTracker for a driver.
mapStatuses .
There is currently a hardcoded limit of map and reduce tasks above which
Note Spark does not assign preferred locations aka locality preferences based on
map output sizes — 1000 for map and reduce each.
You should see the following INFO message when the MapOutputTrackerMaster is created
(FIXME it uses MapOutputTrackerMasterEndpoint ):
MapOutputTracker 236
Mastering Apache Spark
MapOutputTrackerMaster.registerShuffle
Caution FIXME
MapOutputTrackerMaster.getStatistics
Caution FIXME
MapOutputTrackerMaster.unregisterMapOutput
Caution FIXME
MapOutputTrackerMaster.registerMapOutputs
Caution FIXME
MapOutputTrackerMaster.incrementEpoch
Caution FIXME
You should see the following DEBUG message in the logs for entries being removed:
MapOutputTrackerMaster.getEpoch
MapOutputTracker 237
Mastering Apache Spark
Caution FIXME
Settings
spark.shuffle.reduceLocality.enabled (default: true) - whether to compute locality
MapOutputTrackerWorker
A MapOutputTrackerWorker is the MapOutputTracker for executors. The internal
mapStatuses map serves as a cache and any miss triggers a fetch from the driver’s
MapOutputTrackerMaster.
MapOutputTracker 238
Mastering Apache Spark
ExecutorAllocationManager
FIXME
ExecutorAllocationManager 239
Mastering Apache Spark
Deployment Environments
Spark Deployment Environments:
local
cluster
Spark Standalone
Spark on Mesos
Spark on YARN
A Spark application can run locally (on a single JVM) or on the cluster which uses a cluster
manager and the deploy mode ( --deploy-mode ). See spark-submit script.
Master URLs
Spark supports the following master URLs (see private object SparkMasterRegex):
local-cluster[N, cores, memory] for simulating a Spark cluster of [N, cores, memory]
locally
You use a master URL with spark-submit as the value of --master command-line option or
when creating SparkContext using setMaster method.
Spark local
You can run Spark in local mode. In this non-distributed single-JVM deployment mode,
Spark spawns all the execution components - driver, executor, backend, and master - in the
same JVM. The default parallelism is the number of threads as specified in the master URL.
This is the only mode where a driver is used for execution.
This mode of operation is also called Spark in-process or (less commonly) a local version
of Spark.
scala> sc.isLocal
res0: Boolean = true
Spark shell defaults to local mode with local[*] as the the master URL.
scala> sc.master
res0: String = local[*]
Tasks are not re-executed on failure in local mode (unless local-with-retries master URL is
used).
The task scheduler in local mode works with LocalBackend task scheduler backend.
Master URL
You can run Spark in local mode using local , local[n] or the most general local[*] for
the master URL.
local[*] uses as many threads as the number of processors available to the Java
FIXME What happens when there’s less cores than n in the master URL?
Caution
It is a question from twitter.
If there is one or more tasks that match the offer, they are launched (using
executor.launchTask method).
LocalBackend
LocalBackend is a scheduler backend and a executor backend for Spark local mode.
It acts as a "cluster manager" for local mode to offer resources on the single worker it
manages, i.e. it calls TaskSchedulerImpl.resourceOffers(offers) with offers being a single-
element collection with WorkerOffer("driver", "localhost", freeCores) .
FIXME Review freeCores . It appears you could have many jobs running
Caution
simultaneously.
LocalEndpoint
LocalEndpoint is the communication channel between Task Scheduler and LocalBackend.
It is a (thread-safe) RPC Endpoint that hosts an executor (with id driver and hostname
localhost ) for Spark local mode.
When a LocalEndpoint starts up (as part of Spark local’s initialization) it prints out the
following INFO messages to the logs:
RPC Messages
LocalEndpoint accepts the following RPC message types:
(using statusUpdate ) and if the task’s status is finished, it revives offers (see
ReviveOffers ).
KillTask (receive-only, non-blocking) that kills the task that is currently running on the
executor.
Settings
spark.default.parallelism (default: the number of threads as specified in master URL)
Hadoop YARN
Apache Mesos
Spark driver communicates with a cluster manager for resources, e.g. CPU, memory, disk.
The cluster manager spawns Spark executors in the cluster.
FIXME
Spark execution in cluster - Diagram of the communication between
driver, cluster manager, workers with executors and tasks. See Cluster
Mode Overview.
Caution Show Spark’s driver with the main code in Scala in the box
Nodes with executors with tasks
Hosts drivers
Manages a cluster
The workers are in charge of communicating the cluster manager the availability of their
resources.
Communication with a driver is through a RPC interface (at the moment Akka), except
Mesos in fine-grained mode.
Executors remain alive after jobs are finished for future ones. This allows for better data
utilization as intermediate data is cached in memory.
fine-grained partitioning
low-latency scheduling
Reusing also means the the resources can be hold onto for a long time.
Spark reuses long-running executors for speed (contrary to Hadoop MapReduce using
short-lived containers for each task).
The Spark driver is launched to invoke the main method of the Spark application.
The driver asks the cluster manager for resources to run the application, i.e. to launch
executors that run tasks.
The driver runs the Spark application and sends tasks to the executors.
Right after SparkContext.stop() is executed from the driver or the main method has
exited all the executors are terminated and the cluster resources are released by the
cluster manager.
"There’s not a good reason to run more than one worker per machine." by Sean
Note Owen in What is the relationship between workers, worker instances, and
executors?
One executor per node may not always be ideal, esp. when your nodes have
Caution lots of RAM. On the other hand, using fewer executors has benefits like
more efficient broadcasts.
Others
Spark application can be split into the part written in Scala, Java, and Python with the
cluster itself in which the application is going to run.
A Spark application consists of a single driver process and a set of executor processes
scattered across nodes on the cluster.
Both the driver and the executors usually run as long as the application. The concept of
dynamic resource allocation has changed it.
A node is a machine, and there’s not a good reason to run more than one worker per
machine. So two worker nodes typically means two machines, each a Spark worker.
Workers hold many executors for many applications. One application has executors on
many workers.
Spark Driver
A separate Java process running on its own JVM
Launches tasks
Submit modes
There are two submit modes, i.e. where a driver runs:
Standalone Master (often written standalone Master) is the cluster manager for Spark
Standalone cluster. See Master.
Standalone workers (aka standalone slaves) are the workers in Spark Standalone cluster.
They can be started and stopped using custom management scripts for standalone Workers.
Spark Standalone cluster is one of the three available clustering options in Spark (refer to
Running Spark on cluster).
Standalone cluster mode is subject to the constraint that only one executor can be allocated
on each worker per application.
Once a Spark Standalone cluster has been started, you can access it using spark://
master URL (refer to Master URLs).
You can deploy, i.e. spark-submit , your applications to Spark Standalone in client or
cluster deploy mode. Refer to Deployment modes
Deployment modes
Caution FIXME
Keeps track of task ids and executor ids, executors per host, hosts per rack
You can give one or many comma-separated masters URLs in spark:// URL.
Caution FIXME
FIXME
SPARK_WORKER_INSTANCES (and
SPARK_WORKER_CORES)
There is really no need to run multiple workers per machine in Spark 1.5 (perhaps in 1.4,
too). You can run multiple executors on the same machine with one worker.
You can set up the number of cores as an command line argument when you start a worker
daemon using --cores .
Since the change SPARK-1706 Allow multiple executors per worker in Standalone mode in
Spark 1.4 it’s currently possible to start multiple executors in a single JVM process of a
worker.
To launch multiple executors on a machine you start multiple standalone workers, each with
its own JVM. It introduces unnecessary overhead due to these JVM processes, provided
that there are enough cores on that worker.
If you are running Spark in standalone mode on memory-rich nodes it can be beneficial to
have multiple worker instances on the same node as a very large heap size has two
disadvantages:
Mesos and YARN can, out of the box, support packing multiple, smaller executors onto the
same physical host, so requesting smaller executors doesn’t mean your application will have
fewer overall resources.
SparkDeploySchedulerBackend
SparkDeploySchedulerBackend is the Scheduler Backend for Spark Standalone, i.e. it is used
AppClient
AppClient is an interface to allow Spark applications to talk to a Standalone cluster (using a
AppClient registers AppClient RPC endpoint (using ClientEndpoint class) to a given RPC
Environment.
AppClient uses a daemon cached thread pool ( askAndReplyThreadPool ) with threads' name
to master.
Others
killExecutors
start
stop
It is a ThreadSafeRpcEndpoint that knows about the RPC endpoint of the primary active
standalone Master (there can be a couple of them, but only one can be active and hence
primary).
An AppClient registers the Spark application to a single master (regardless of the number of
the standalone masters given in the master URL).
An AppClient tries connecting to a standalone master 3 times every 20 seconds per master
before giving up. They are not configurable parameters.
successful application registration. It comes with the application id and the master’s RPC
endpoint reference.
The AppClientListener gets notified about the event via listener.connected(appId) with
appId being an application id.
ApplicationRemoved is received from the primary master to inform about having removed the
It can come from the standalone Master after a kill request from Web UI, application has
finished properly or the executor where the application was still running on has been killed,
failed, lost or exited.
stop the AppClient after the SparkContext has been stopped (and so should the running
application on the standalone cluster).
Master
Standalone Master (often written standalone Master) is the cluster manager for Spark
Standalone cluster. It can be started and stopped using custom management scripts for
standalone Master.
A standalone Master is pretty much the Master RPC Endpoint that you can access using
RPC port (low-level operation communication) or Web UI.
workers ( workers )
applications ( apps )
endpointToApp
addressToApp
completedApps
nextAppNumber
drivers ( drivers )
completedDrivers
nextDriverNumber
The following INFO shows up when the Master endpoint starts up ( Master#onStart is
called):
Master WebUI
FIXME MasterWebUI
MasterWebUI is the Web UI server for the standalone master. Master starts Web UI to listen
States
Master can be in the following states:
RECOVERING
COMPLETING_RECOVERY
Caution FIXME
RPC Environment
The org.apache.spark.deploy.master.Master class starts sparkMaster RPC environment.
The Master endpoint starts the daemon single-thread scheduler pool master-forward-
message-thread . It is used for worker management, i.e. removing any timed-out workers.
Metrics
Master uses Spark Metrics System (via MasterSource ) to report metrics about internal
status.
FIXME
Review org.apache.spark.metrics.MetricsConfig
Caution
How to access the metrics for master? See Master#onStart
Review masterMetricsSystem and applicationMetricsSystem
REST Server
The standalone Master starts the REST Server service for alternative application submission
that is supposed to work across Spark versions. It is enabled by default (see
spark.master.rest.enabled) and used by spark-submit for the standalone cluster mode, i.e. -
-deploy-mode is cluster .
The following INFOs show up when the Master Endpoint starts up ( Master#onStart is
called) with REST Server enabled:
Recovery Mode
A standalone Master can run with recovery mode enabled and be able to recover state
among the available swarm of masters. By default, there is no recovery, i.e. no persistence
and no election.
Only a master can schedule tasks so having one always on is important for
Note cases where you want to launch new tasks. Running tasks are unaffected by
the state of the master.
The Recovery Mode enables election of the leader master among the masters.
Check out the exercise Spark Standalone - Using ZooKeeper for High-Availability
Tip
of Master.
Leader Election
Master endpoint is LeaderElectable , i.e. FIXME
Caution FIXME
RPC Messages
Master communicates with drivers, executors and configures itself using RPC messages.
CompleteRecovery
RevokedLeadership
RegisterApplication
ExecutorStateChanged
DriverStateChanged
Heartbeat
MasterChangeAcknowledged
WorkerSchedulerStateResponse
UnregisterApplication
CheckForWorkerTimeOut
RegisterWorker
RequestSubmitDriver
RequestKillDriver
RequestDriverStatus
RequestMasterState
BoundPortsRequest
RequestExecutors
KillExecutors
RegisterApplication event
A RegisterApplication event is sent by AppClient to the standalone Master. The event
holds information about the application being deployed ( ApplicationDescription ) and the
driver’s endpoint reference.
executor’s memory, command, appUiUrl, and user with optional eventLogDir and
eventLogCodec for Event Logs, and the number of cores per executor.
A driver has a state, i.e. driver.state and when it’s in DriverState.RUNNING state the driver
has been assigned to a worker for execution.
The message holds information about the id and name of the driver.
A driver can be running on a single worker while a worker can have many drivers running.
When a worker receives a LaunchDriver message, it prints out the following INFO:
It then creates a DriverRunner and starts it. It starts a separate JVM process.
Workers' free memory and cores are considered when assigning some to waiting drivers
(applications).
DriverRunner
Warning It seems a dead piece of code. Disregard it for now.
It is a java.lang.Process
Internals of org.apache.spark.deploy.master.Master
The above command suspends ( suspend=y ) the process until a JPDA debugging client, e.g. you
When Master starts, it first creates the default SparkConf configuration whose values it
then overrides using environment variables and command-line options.
It starts RPC Environment with necessary endpoints and lives until the RPC environment
terminates.
Worker Management
When a worker hasn’t responded for spark.worker.timeout , it is assumed dead and the
following WARN message appears in the logs:
SPARK_PUBLIC_DNS (default: hostname) - the custom master hostname for WebUI’s http
spark-defaults.conf from which all properties that start with spark. prefix are loaded.
Settings
FIXME
Caution
Where are `RETAINED_’s properties used?
totalExpectedCores ). When set, an application could get executors of different sizes (in
terms of cores).
spark.dead.worker.persistence (default: 15 )
StandaloneRecoveryModeFactory .
nodes (spreading out each app among all the nodes). Refer to Round-robin Scheduling
Across Nodes
Executor Summary
Executor Summary page displays information about the executors for the application id
given as the appId request parameter.
When an executor is added to the pool of available executors, it enters LAUNCHING state. It
can then enter either RUNNING or FAILED states.
ExecutorRunner.killProcess
If no application for the appId could be found, Not Found page is displayed.
sbin/start-master.sh
sbin/start-master.sh script starts a Spark master on the machine the script is executed on.
./sbin/start-master.sh
org.apache.spark.deploy.master.Master \
--ip japila.local --port 7077 --webui-port 8080
It has support for starting Tachyon using --with-tachyon command line option. It assumes
tachyon/bin/tachyon command be available in Spark’s home directory.
sbin/spark-config.sh
bin/load-spark-env.sh
Command-line Options
You can use the following command-line options:
overrides it.
sbin/stop-master.sh
You can stop a Spark Standalone master using sbin/stop-master.sh script.
./sbin/stop-master.sh
./sbin/start-slave.sh [masterURL]
The order of importance of Spark configuration settings is as follows (from least to the most
important):
Command-line options
Spark properties
slave.
SPARK_WORKER_PORT - the base port number to listen on for the first worker. If set,
subsequent workers will increment this number. If unset, Spark will pick a random port.
SPARK_WORKER_WEBUI_PORT (default: 8081 ) - the base port for the web UI of the first
worker. Subsequent workers will increment this number. If the port is used, the
successive ports are tried until a free one is found.
sbin/spark-config.sh
bin/load-spark-env.sh
Command-line Options
You can use the following command-line options:
variable.
environment variable.
variable.
environment variable.
properties file
--help
Spark properties
After loading the default SparkConf, if --properties-file or SPARK_WORKER_OPTS define
spark.worker.ui.port , the value of the property is used as the port of the worker’s web UI.
or
$ cat worker.properties
spark.worker.ui.port=33333
sbin/spark-daemon.sh
Ultimately, the script calls sbin/spark-daemon.sh start to kick off
org.apache.spark.deploy.worker.Worker with --webui-port , --port and the master URL.
Internals of org.apache.spark.deploy.worker.Worker
Upon starting, a Spark worker creates the default SparkConf.
It starts sparkWorker RPC Environment and waits until the RpcEnv terminates.
RPC environment
The org.apache.spark.deploy.worker.Worker class starts its own sparkWorker RPC
environment with Worker endpoint.
It has support for starting Tachyon using --with-tachyon command line option. It assumes
tachyon/bin/tachyon command be available in Spark’s home directory.
sbin/spark-config.sh
bin/load-spark-env.sh
conf/spark-env.sh
The script uses the following environment variables (and sets them when unavailable):
SPARK_PREFIX
SPARK_HOME
SPARK_CONF_DIR
SPARK_MASTER_PORT
SPARK_MASTER_IP
The following command will launch 3 worker instances on each node. Each worker instance
will use two cores.
If you however want to filter out the JVM processes that really belong to Spark you should
pipe the command’s output to OS-specific tools like grep .
$ jps -lm
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080
397
669 org.jetbrains.idea.maven.server.RemoteMavenServer
1198 sun.tools.jps.Jps -lm
spark-daemon.sh status
You can also check out ./sbin/spark-daemon.sh status .
When you start Spark Standalone using scripts under sbin , PIDs are stored in /tmp
directory by default. ./sbin/spark-daemon.sh status can read them and do the "boilerplate"
for you, i.e. status a PID.
$ ls /tmp/spark-*.pid
/tmp/spark-jacek-org.apache.spark.deploy.master.Master-1.pid
You can use the Spark Standalone cluster in the following ways:
Use spark-shell with --master MASTER_URL
Important
Use SparkConf.setMaster(MASTER_URL) in your Spark application
For our learning purposes, MASTER_URL is spark://localhost:7077 .
./sbin/start-master.sh
Notes:
Use spark.deploy.spreadOut (default: true ) to allow users to set a flag that will
perform round-robin scheduling across the nodes (spreading out each app among
all the nodes) instead of trying to consolidate each app onto a small # of nodes.
./sbin/start-slave.sh spark://japila.local:7077
4. Check out master’s web UI at http://localhost:8080 to know the current setup - one
worker.
5. Let’s stop the worker to start over with custom configuration. You use ./sbin/stop-
slave.sh to stop the worker.
./sbin/stop-slave.sh
6. Check out master’s web UI at http://localhost:8080 to know the current setup - one
worker in DEAD state.
8. Check out master’s web UI at http://localhost:8080 to know the current setup - one
worker ALIVE and another DEAD.
Figure 4. Master’s web UI with one worker ALIVE and one DEAD
9. Configuring cluster using conf/spark-env.sh
conf/spark-env.sh
SPARK_WORKER_CORES=2 (1)
SPARK_WORKER_INSTANCES=2 (2)
SPARK_WORKER_MEMORY=2g
./sbin/start-slave.sh spark://japila.local:7077
$ ./sbin/start-slave.sh spark://japila.local:7077
starting org.apache.spark.deploy.worker.Worker, logging to ../logs/spark-
starting org.apache.spark.deploy.worker.Worker, logging to ../logs/spark-
11. Check out master’s web UI at http://localhost:8080 to know the current setup - at least
two workers should be ALIVE.
$ jps
Note 6580 Worker
4872 Master
6874 Jps
6539 Worker
./sbin/stop-all.sh
Spark on Mesos
Running Spark on Mesos
A Mesos cluster needs at least one Mesos Master to coordinate and dispatch tasks onto
Mesos Slaves.
$ mesos-slave --master=127.0.0.1:5050
I0401 00:15:05.850455 1916461824 main.cpp:223] Build: 2016-03-17 14:20:58 by brew
I0401 00:15:05.850772 1916461824 main.cpp:225] Version: 0.28.0
I0401 00:15:05.852812 1916461824 containerizer.cpp:149] Using isolation: posix/cpu,posix/mem,filesyst
I0401 00:15:05.866186 1916461824 main.cpp:328] Starting Mesos slave
I0401 00:15:05.869470 218980352 slave.cpp:193] Slave started on 1)@10.1.47.199:5051
...
I0401 00:15:05.906355 218980352 slave.cpp:832] Detecting new master
I0401 00:15:06.762917 220590080 slave.cpp:971] Registered with master master@127.0.0.1:5050; given sl
...
Figure 2. Mesos Management Console (Slaves tab) with one slave running
You have to export MESOS_NATIVE_JAVA_LIBRARY environment variable
before connecting to the Mesos cluster.
Important
$ export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib
The preferred approach to launch Spark on Mesos and to give the location of Spark
binaries is through spark.executor.uri setting.
--conf spark.executor.uri=/Users/jacek/Downloads/spark-1.5.2-bin-hadoop2.6.tgz
Note
For us, on a bleeding edge of Spark development, it is very convenient to use
spark.mesos.executor.home setting, instead.
-c spark.mesos.executor.home=`pwd`
In Frameworks tab you should see a single active framework for spark-shell .
Figure 3. Mesos Management Console (Frameworks tab) with Spark shell active
Tip Consult slave logs under /tmp/mesos/slaves when facing troubles.
CoarseMesosSchedulerBackend
CoarseMesosSchedulerBackend is the scheduler backend for Spark on Mesos.
It requires a Task Scheduler, Spark context, mesos:// master URL, and Security Manager.
It accepts only two failures before blacklisting a Mesos slave (it is hardcoded and not
configurable).
It tracks:
Settings
spark.default.parallelism (default: 8 ) - the number of cores to use as Default Level
of Parallelism.
FIXME
MesosExternalShuffleClient
FIXME
(Fine)MesosSchedulerBackend
When spark.mesos.coarse is false , Spark on Mesos uses MesosSchedulerBackend
reviveOffers
It calls mesosDriver.reviveOffers() .
Caution FIXME
Settings
spark.mesos.coarse (default: true ) controls whether the scheduler backend for Mesos
FIXME Review
Caution MesosClusterScheduler.scala
MesosExternalShuffleService
Introduction to Mesos
Mesos is a distributed system kernel
Concepts in Mesos:
Frameworks
Mesos master
Mesos API
Mesos agent
Mesos typically runs with an agent on every virtual machine or bare metal server under
management (https://www.joyent.com/blog/mesos-by-the-pound)
Mesos uses Zookeeper for master election and discovery. Apache Auroa is a scheduler
that runs on Mesos.
Mesos is written in C++, not Java, and includes support for Docker along with other
frameworks. Mesos, then, is the core of the Mesos Data Center Operating System, or
DCOS, as it was coined by Mesosphere.
This Operating System includes other handy components such as Marathon and
Chronos. Marathon provides cluster-wide “init” capabilities for application in containers
like Docker or cgroups. This allows one to programmatically automate the launching of
large cluster-based applications. Chronos acts as a Mesos API for longer-running batch
type jobs while the core Mesos SDK provides an entry point for other applications like
Hadoop and Spark.
The true goal is a full shared, generic and reusable on demand distributed architecture.
Out of the box it will include Cassandra, Kafka, Spark, and Akka.
to allow Mesos to centrally schedule YARN work via a Mesos based framework,
including a REST API for scaling up or down
Schedulers in Mesos
Available scheduler modes:
fine-grained mode
The main difference between these two scheduler modes is the number of tasks per Spark
executor per single Mesos executor. In fine-grained mode, there is a single task in a single
Spark executor that shares a single Mesos executor with the other Spark executors. In
coarse-grained mode, there is a single Spark executor per Mesos executor with many Spark
tasks.
Coarse-grained mode pre-starts all the executor backends, e.g. Executor Backends, so it
has the least overhead comparing to fine-grain mode. Since the executors are up before
tasks get launched, it is better for interactive sessions. It also means that the resources are
locked up in a task.
Spark on Mesos supports dynamic allocation in the Mesos coarse-grained scheduler since
Spark 1.5. It can add/remove executors based on load, i.e. kills idle executors and adds
executors when tasks queue up. It needs an external shuffle service on each node.
Mesos Fine-Grained Mode offers a better resource utilization. It has a slower startup for
tasks and hence it is fine for batch and relatively static streaming.
Commands
The following command is how you could execute a Spark application on Mesos:
Other Findings
From Four reasons to pay attention to Apache Mesos:
to run Spark workloads well you need a resource manager that not only can handle the
rapid swings in load inherent in analytics processing, but one that can do to smartly.
Matching of the task to the RIGHT resources is crucial and awareness of the physical
environment is a must. Mesos is designed to manage this problem on behalf of
workloads like Spark.
Spark on YARN
Spark on YARN supports multiple application attempts.
Introduction to YARN
Introduction to YARN by Adam Kawa is an excellent introduction to YARN. Here are the
most important facts to get you going.
Hadoop 2.0 comes with Yet Another Resource Negotiator (YARN) that is a generic
cluster resource management framework that can run applications on a Hadoop cluster.
(see Apache Twill)
NameNode
Hadoop for storing and processing large amount of data on a cluster of commodity
hardware.
The Pre-YARN MapReduce engine - MRv1 - was rewritten for YARN. It became yet
another YARN distributed application called MRv2.
There’s another article that covers the fundamentals of YARN - Untangling Apache Hadoop
YARN, Part 1. Notes follow:
A host is the Hadoop term for a computer (also called a node, in YARN terminology).
It can technically also be a single host used for debugging and simple testing.
Master hosts are a small number of hosts reserved to control the rest of the cluster.
Worker hosts are the non-master hosts in the cluster.
A master host is the communication point for a client program. A master host sends
the work to the rest of the cluster, which consists of worker hosts.
The ResourceManager is the master daemon that communicates with the client,
tracks resources on the cluster, and orchestrates work by assigning tasks to
NodeManagers.
In a Hadoop cluster with YARN running, the master process is called the
ResourceManager and the worker processes are called NodeManagers.
The NodeManager on each host keeps track of the local host’s resources, and the
ResourceManager keeps track of the cluster’s total.
The YARN configuration file is an XML file that contains properties. This file is placed in
a well-known location on each host in the cluster and is used to configure the
ResourceManager and NodeManager. By default, this file is named yarn-site.xml .
Each NodeManager tracks its own local resources and communicates its resource
configuration to the ResourceManager, which keeps a running total of the cluster’s
available resources.
Once a hold has been granted on a host, the NodeManager launches a process called
a task.
For each running application, a special piece of code called an ApplicationMaster helps
coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run
after the application starts.
One or more tasks that do the actual work (runs in a process) in the container
allocated by YARN.
The application starts and talks to the ResourceManager (running on the master)
for the cluster.
Once all tasks are finished, the ApplicationMaster exits. The last container is de-
allocated from the cluster.
Hadoop YARN
From Apache Hadoop's web site:
Hadoop YARN: A framework for job scheduling and cluster resource management.
YARN is essentially a container system and scheduler designed primarily for use with a
Hadoop-based cluster.
Spark runs on YARN clusters, and can read from and save data to HDFS.
Spark needs distributed file system and HDFS (or Amazon S3, but slower) is a great
choice.
Excellent throughput when Spark and Hadoop are both distributed and co-located on
the same (YARN or Mesos) cluster nodes.
When reading data from HDFS, each InputSplit maps to exactly one Spark partition.
HDFS is distributing files on data-nodes and storing a file on the filesystem, it will be
split into partitions.
How it works
The Spark driver in Spark on YARN launches a number of executors. Each executor
processes a partition of HDFS-based data.
YarnAllocator
YarnAllocator requests containers from the YARN ResourceManager and decides what to
do with containers when YARN fulfills these requests. It uses YARN’s AMRMClient APIs.
ExecutorAllocationClient
ExecutorAllocationClient is a client class that communicates with the cluster manager to
request or kill executors.
Misc
SPARK_YARN_MODE property and environment variable
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
YARN integration has some advantages, like dynamic allocation. If you enable dynamic
allocation, after the stage including InputSplits gets submitted, Spark will try to request
an appropriate number of executors.
The memory in the YARN resource requests is --executor-memory + what’s set for
spark.yarn.executor.memoryOverhead , which defaults to 10% of --executor-memory .
if YARN has enough resources it will deploy the executors distributed across the cluster,
then each of them will try to process the data locally ( NODE_LOCAL in Spark Web UI),
with as many splits in parallel as you defined in spark.executor.cores .
"YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI
to ensure that workers are registered and have sufficient resources"
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors N
spark.dynamicAllocation.initialExecutors 0
spark.dynamicAllocation.minExecutors requires
spark.dynamicAllocation.initialExecutors
Cluster Mode
Spark on YARN supports submitting Spark applications in cluster deploy mode.
YarnClusterSchedulerBackend
YarnClusterSchedulerBackend is a scheduler backend for Spark on YARN in cluster deploy
mode.
This is the only scheduler backend that supports multiple application attempts and URLs for
driver’s logs to display as links in the web UI in the Executors tab for the driver.
It uses spark.yarn.app.attemptId under the covers (that the YARN resource manager
sets?).
Caution FIXME
YarnScheduler
It appears that this is a custom implementation to keep track of racks per host that is used in
TaskSetManager.resourceOffer to find a task with RACK_LOCAL locality preferences.
Execution Model
FIXME This is the single place for explaining jobs, stages, tasks. Move
Caution
relevant parts from the other places.
Broadcast Variables
From the official documentation about Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks.
Explicitly creating broadcast variables is only useful when tasks across multiple stages
need the same data or when caching the data in deserialized form is important.
The Broadcast feature in Spark uses SparkContext to create broadcast values and
BroadcastManager and ContextCleaner to manage their lifecycle.
Introductory Example
Let’s start with an introductory example to check out how to use broadcast variables and
build your initial understanding.
You’re going to use a static mapping of interesting projects with their websites, i.e.
Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations,
use.
scala> val pws = Map("Apache Spark" -> "http://spark.apache.org/", "Scala" -> "http://www.scala-lang.
pws: scala.collection.immutable.Map[String,String] = Map(Apache Spark -> http://spark.apache.org/, Sc
It works, but is very ineffective as the pws map is sent over the wire to executors while it
could have been there already. If there were more tasks that need the pws map, you could
improve their performance by minimizing the number of bytes that are going to be sent over
the network for task execution.
Semantically, the two computations - with and without the broadcast value - are exactly the
same, but the broadcast-based one wins performance-wise when there are more executors
spawned to execute many tasks that use pws map.
Introduction
Broadcast is part of Spark that is responsible for broadcasting information across nodes in
a cluster.
You use broadcast variable to implement map-side join, i.e. a join using a map . For this,
lookup tables are distributed across nodes in a cluster using broadcast and then looked up
inside map (to do the join implicitly).
When you broadcast a value, it is copied to executors only once (while it is copied multiple
times for tasks otherwise). It means that broadcast can help to get your Spark application
faster if you have a large value to use in tasks or there are more tasks than executors.
It appears that a Spark idiom emerges that uses broadcast with collectAsMap to create a
Map for broadcast. When an RDD is map over to a smaller dataset (column-wise not
record-wise), collectAsMap , and broadcast , using the very big RDD to map its elements to
the broadcast RDDs is computationally faster.
Use large broadcasted HashMaps over RDDs whenever possible and leave RDDs with a
key to lookup necessary data as demonstrated above.
SparkContext.broadcast
Read about SparkContext.broadcast method in Creating broadcast variables.
Further Reading
Map-Side Join in Spark
Accumulators in Spark
Read the latest documentation about accumulators before looking for anything
Tip
useful here.
Noticed on the user@spark mailing list that using an external key-value store (like HBase,
Redis, Cassandra) and performing lookups/updates inside of your mappers (creating a
connection within a mapPartitions code block to avoid the connection setup/teardown
overhead) might be a better solution.
Accumulators 299
Mastering Apache Spark
Spark Security
Enable security via spark.authenticate property (defaults to false ).
See org.apache.spark.SecurityManager
Securing Web UI
Tip Read the official document Web UI.
To secure Web UI you implement a security filter and use spark.ui.filters setting to refer
to the class.
neolitec/BasicAuthenticationFilter.java
Spark offers different APIs to read data based upon the content and the storage.
binary
text
files
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data
to HDFS. You’ll know what I mean the first time you try to save "all-the-data.csv" and
are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS
file and then several part-0000n files for each partition that took part in the job.
Methods:
Returns HadoopRDD
When using textFile to read an HDFS folder with multiple files inside, the number
of partitions are equal to the number of HDFS blocks.
URLs supported:
s3://… or s3n://…
hdfs://…
file://…;
The general rule seems to be to use HDFS to read files multiple times with S3 as a storage
for a one-time access.
saveAsTextFile
saveAsObjectFile
saveAsSequenceFile
saveAsHadoopFile
Since an RDD is actually a set of partitions that make for it, saving an RDD to a file saves
the content of each partition to a file (per partition).
1. parallelize uses 4 to denote the number of partitions so there are going to be 4 files
saved.
S3
s3://… or s3n://… URL are supported.
configuration).
scala> f.foreach(println)
...
15/09/13 19:06:52 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/f.txt.gz:0+38
15/09/13 19:06:52 INFO CodecPool: Got brand-new decompressor [.gz]
Ala ma kota
if the directory contains multiple SequenceFiles all of them will be added to RDD
SequenceFile RDD
cp conf/log4j.properties.template conf/log4j.properties
Edit conf/log4j.properties so the line log4j.rootCategory uses appropriate log level, e.g.
log4j.rootCategory=ERROR, console
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
FIXME
Describe the other computing models using Spark SQL, MLlib, Spark Streaming, and
GraphX.
$ ./bin/spark-shell
...
Spark context available as sc.
...
SQL context available as sqlContext.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.addFile("/Users/jacek/dev/sandbox/hello.json")
scala> SparkFiles.get("/Users/jacek/dev/sandbox/hello.json")
See org.apache.spark.SparkFiles.
scala> sc.textFile("http://japila.pl").foreach(println)
java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
...
Serialization
Serialization systems:
Java serialization
Kryo
Avro
Thrift
Protobuf
Serialization 309
Mastering Apache Spark
They further abstract the concept of RDD and leverage the resiliency and distribution of the
computing platform.
There are the following built-in libraries that all constitute the Spark platform:
Spark SQL
Spark GraphX
Spark Streaming
Spark MLlib
Spark Streaming
Spark Streaming is a stream processing extension of Spark.
Spark Streaming offers the data abstraction called DStream that hides the complexity of
dealing with a continuous data stream and makes it as easy for programmers as using one
single RDD at a time.
I think Spark Streaming shines on performing the T stage well, i.e. the
Note transformation stage, while leaving the E and L stages for more specialized
tools like Apache Kafka or frameworks like Akka.
For a software developer, a DStream is similar to work with as a RDD with the DStream API
to match RDD API. Interestingly, you can reuse your RDD-based code and apply it to
DStream - a stream of RDDs - with no changes at all (through foreachRDD).
It runs streaming jobs every batch duration to pull and process data (often called records)
from one or many input streams.
Each batch computes (generates) a RDD for data in input streams for a given batch and
submits a Spark job to compute the result. It does this over and over again until the
streaming context is stopped (and the owning streaming application terminated).
To avoid losing records in case of failure, Spark Streaming supports checkpointing that
writes received records to a highly-available HDFS-compatible storage and allows to recover
from temporary downtimes.
Spark Streaming allows for integration with real-time data sources ranging from such basic
ones like a HDFS-compatible file system or socket connection to more advanced ones like
Apache Kafka or Apache Flume.
About Spark Streaming from the official documentation (that pretty much nails what it offers):
Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams. Data can be ingested
from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and
can be processed using complex algorithms expressed with high-level functions like
map, reduce, join and window. Finally, processed data can be pushed out to
filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine
learning and graph processing algorithms on data streams.
StreamingContext
Stream Operators
Streaming Job
Receivers
Micro Batch
Micro Batch is a collection of input records as collected by Spark Streaming that is later
represented as an RDD.
Streaming Job
A streaming Job represents a Spark computation with one or many Spark jobs.
It is identified (in the logs) as streaming job [time].[outputOpId] with outputOpId being the
position in the sequence of jobs in a JobSet.
Internal Registries
nextInputStreamId - the current InputStream id
StreamingSource
Caution FIXME
StreamingContext
StreamingContext is the main entry point for all Spark Streaming functionality. Whatever you
Once streaming pipelines are developed, you start StreamingContext to set the stream
transformations in motion. You stop the instance when you are done.
Creating Instance
You can create a new instance of StreamingContext using the following constructors. You
can group them by whether a StreamingContext constructor creates it from scratch or it is
recreated from checkpoint directory (follow the links for their extensive coverage).
StreamingContext(path: String)
StreamingContext 316
Mastering Apache Spark
StreamingContext will warn you when you use local or local[1] master URLs:
A JobScheduler is created.
A StreamingJobProgressListener is created.
A StreamingSource is instantiated.
Creating ReceiverInputDStreams
StreamingContext offers the following methods to create ReceiverInputDStreams:
receiverStream(receiver: Receiver[T])
StreamingContext 317
Mastering Apache Spark
You can also use two additional methods in StreamingContext to build (or better called
compose) a custom DStream:
receiverStream method
You can register a custom input dstream using receiverStream method. It accepts a
Receiver.
transform method
transform Example
import org.apache.spark.rdd.RDD
def union(rdds: Seq[RDD[_]], time: Time) = {
rdds.head.context.union(rdds.map(_.asInstanceOf[RDD[Int]]))
}
ssc.transform(Seq(cis), union)
remember method
StreamingContext 318
Mastering Apache Spark
remember method sets the remember interval (for the graph of output dstreams). It simply
Checkpoint Interval
The checkpoint interval is an internal property of StreamingContext and corresponds to
batch interval or checkpoint interval of the checkpoint (when checkpoint was present).
Note The checkpoint interval property is also called graph checkpointing interval.
checkpoint interval is mandatory when checkpoint directory is defined (i.e. not null ).
Checkpoint Directory
A checkpoint directory is a HDFS-compatible directory where checkpoints are written to.
Its initial value depends on whether the StreamingContext was (re)created from a checkpoint
or not, and is the checkpoint directory if so. Otherwise, it is not set (i.e. null ).
You can set the checkpoint directory when a StreamingContext is created or later using
checkpoint method.
Initial Checkpoint
Initial checkpoint is the checkpoint (file) this StreamingContext has been recreated from.
StreamingContext 319
Mastering Apache Spark
isCheckpointPresent internal method behaves like a flag that remembers whether the
StreamingContext instance was created from a checkpoint or not so the other internal parts
of a streaming application can make decisions how to initialize themselves (or just be
initialized).
isCheckpointPresent checks the existence of the initial checkpoint that gave birth to the
StreamingContext.
You use checkpoint method to set directory as the current checkpoint directory.
method.
start(): Unit
You start stream processing by calling start() method. It acts differently per state of
StreamingContext and only INITIALIZED state makes for a proper startup.
StreamingContext 320
Mastering Apache Spark
If no other StreamingContext exists, it performs setup validation and starts JobScheduler (in
a separate dedicated daemon thread called streaming-start).
It then register the shutdown hook stopOnShutdown and registers streaming metrics source.
If web UI is enabled (by spark.ui.enabled ), it attaches the Streaming tab.
Given all the above has have finished properly, it is assumed that the StreamingContext
started fine and so you should see the following INFO message in the logs:
StreamingContext 321
Mastering Apache Spark
stop methods stop the execution of the streams immediately ( stopGracefully is false )
or wait for the processing of all received data to be completed ( stopGracefully is true ).
stop reacts appropriately per the state of StreamingContext , but the end state is always
1. JobScheduler is stopped.
4. ContextWaiter is notifyStop()
5. shutdownHookRef is cleared.
At that point, you should see the following INFO message in the logs:
StreamingContext 322
Mastering Apache Spark
shuts down, e.g. all non-daemon thread exited, System.exit was called or ^C was typed.
Setup Validation
StreamingContext 323
Mastering Apache Spark
validate(): Unit
It first asserts that DStreamGraph has been assigned (i.e. graph field is not null ) and
triggers validation of DStreamGraph.
If checkpointing is enabled, it ensures that checkpoint interval is set and checks whether the
current streaming runtime environment can be safely serialized by serializing a checkpoint
for fictitious batch time 0 (not zero time).
If dynamic allocation is enabled, it prints the following WARN message to the logs:
Caution FIXME
States
StreamingContext can be in three states:
StreamingContext 324
Mastering Apache Spark
Stream Operators
You use stream operators to apply transformations to the elements received (often called
records) from input streams and ultimately trigger computations using output operators.
Transformations are stateless, but Spark Streaming comes with an experimental support for
stateful operators (e.g. mapWithState or updateStateByKey). It also offers windowed
operators that can work across batches.
You may use RDDs from other (non-streaming) data sources to build more
Note
advanced pipelines.
output operators that register input streams as output streams so the execution can
start.
(output operator) print to print 10 elements only or the more general version
print(num: Int) to print up to num elements. See print operation in this document.
slice
window
reduceByWindow
reduce
map
glom
transform
transformWith
flatMap
filter
repartition
mapPartitions
count
countByValue
countByWindow
countByValueAndWindow
union
Most streaming operators come with their own custom DStream to offer the service. It
however very often boils down to overriding the compute method and applying
corresponding RDD operator on a generated RDD.
print Operator
print(num: Int) operator prints num first elements of each RDD in the input stream.
For each batch, print operator prints the following header to the standard output
(regardless of the number of elements to be printed out):
-------------------------------------------
Time: [time] ms
-------------------------------------------
Internally, it calls RDD.take(num + 1) (see take action) on each RDD in the stream to print
num elements. It then prints … if there are more elements in the RDD (that would
foreachRDD Operators
foreachRDD Example
glom Operator
glom(): DStream[Array[T]]
glom operator creates a new stream in which RDDs in the source stream are RDD.glom
over, i.e. it coalesces all elements in RDDs within each partition into an array.
reduce Operator
reduce operator creates a new stream of RDDs of a single element that is a result of
reduce Example
map Operator
map operator creates a new stream with the source elements being mapped over using
mapFunc function.
It creates MappedDStream stream that, when requested to compute a RDD, uses RDD.map
operator.
map Example
reduceByKey Operator
transform Operators
transform operator applies transformFunc function to the generated RDD for a batch.
It asserts that one and exactly one RDD has been generated for a batch before
Note
calling the transformFunc .
transform Example
import org.apache.spark.rdd.RDD
val transformFunc: RDD[Int] => RDD[Int] = { inputRDD =>
println(s">>> inputRDD: $inputRDD")
inputRDD
}
clicks.transform(transformFunc).print
transformWith Operators
transformWith operators apply the transformFunc function to two generated RDD for a
batch.
It asserts that two and exactly two RDDs have been generated for a batch
Note
before calling the transformFunc .
transformWith Example
val ns = sc.parallelize(0 to 2)
import org.apache.spark.streaming.dstream.ConstantInputDStream
val nums = new ConstantInputDStream(ssc, ns)
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.Time
val transformFunc: (RDD[Int], RDD[String], Time) => RDD[(Int, String)] = { case (ns, ws, time) =>
println(s">>> ns: $ns")
println(s">>> ws: $ws")
println(s">>> batch: $time")
ns.zip(ws)
}
nums.transformWith(words, transformFunc).print
Windowed Operators
Go to Window Operations to read the official documentation.
Note This document aims at presenting the internals of window operators with
examples.
In short, windowed operators allow you to apply transformations over a sliding window of
data, i.e. build a stateful computation across multiple batches.
By default, you apply transformations using different stream operators to a single RDD that
represents a dataset that has been built out of data received from one or many input
streams. The transformations know nothing about the past (datasets received and already
processed). The computations are hence stateless.
You can however build datasets based upon the past ones, and that is when windowed
operators enter the stage. Using them allows you to cross the boundary of a single dataset
(per batch) and have a series of datasets in your hands (as if the data they hold arrived in a
single batch interval).
slice Operators
slice operators return a collection of RDDs that were generated during time interval
Both Time ends have to be a multiple of this stream’s slide duration. Otherwise, they are
aligned using Time.floor method.
When used, you should see the following INFO message in the logs:
window Operators
window operator creates a new stream that generates RDDs containing all the elements
Note windowDuration must be a multiple of the slide duration of the source stream.
messages.window(Seconds(10))
reduceByWindow Operator
reduceByWindow operator create a new stream of RDDs of one element only that was
computed using reduceFunc function over the data received during batch duration that later
was again applied to a collection of the reduced elements from the past being window
duration windowDuration sliding slideDuration forward.
reduceByWindow Example
// batchDuration = Seconds(5)
windowedClicks.print
SaveAs Operators
There are two saveAs operators in DStream:
saveAsObjectFiles
saveAsTextFiles
They are output operators that return nothing as they save each RDD in a batch to a
storage.
RDD.saveAsTextFile.
The file name is based on mandatory prefix and batch time with optional suffix . It is in
the format of [prefix]-[time in milliseconds].[suffix] .
Example
cumulative calculations.
The motivation for the stateful operators is that by design streaming operators are stateless
and know nothing about the previous records and hence a state. If you’d like to react to new
records appropriately given the previous records you would have to resort to using persistent
storages outside Spark Streaming.
mapWithState Operator
You create StateSpec instances for mapWithState operator using the factory methods
StateSpec.function.
mapWithState Example
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
mappedStatefulStream.print()
initialState which is the initial state of the transformation, i.e. paired RDD[(KeyType,
StateType) .
timeout that sets the idle duration after which the state of an idle key will be removed.
A key and its state is considered idle if it has not received any data for at least the given
idle duration.
You create StateSpec instances using the factory methods StateSpec.function (that differ
in whether or not you want to access a batch time and return an optional mapped value):
updateStateByKey Operator
1. When not specified explicitly, the partitioner used is HashPartitioner with the number of
partitions being the default level of parallelism of a Task Scheduler.
2. You may however specify the number of partitions explicitly for HashPartitioner to use.
3. This is the "canonical" updateStateByKey the other two variants (without a partitioner or
the number of partitions) use that allows specifying a partitioner explicitly. It then
executes the "last" updateStateByKey with rememberPartitioner enabled.
updateStateByKey stateful operator allows for maintaining per-key state and updating it
using updateFn . The updateFn is called for each key, and uses new data and existing state
of the key, to generate an updated state.
The state update function updateFn scans every key and generates a new state for every
key given a collection of values per key in a batch and the current state for the key (if exists).
Note The operator does not offer any timeout of idle data.
updateStateByKey Example
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
// helper functions
val inc = (n: Int) => n + 1
def buildState: Option[Int] = {
println(s">>> >>> Initial execution to build state or state is deliberately uninitialized yet"
println(s">>> >>> Building the state being the number of calls to update state function, i.e. the n
Some(1)
}
The page is made up of three sections (aka tables) - the unnamed, top-level one with basic
information about the streaming application (right below the title Streaming Statistics),
Active Batches and Completed Batches.
Basic Information
Basic Information section is the top-level section in the Streaming page that offers basic
information about the streaming application.
It shows the number of all completed batches (for the entire period since the
StreamingContext was started) and received records (in parenthesis). These information
are later displayed in detail in Active Batches and Completed Batches sections.
Below is the table for retained batches (i.e. waiting, running, and completed batches).
In Input Rate row, you can show and hide details of each input stream.
If there are input streams with receivers, the numbers of all the receivers and active ones
are displayed (as depicted in the Figure 2 above).
The average event rate for all registered streams is displayed (as Avg: [avg] events/sec).
Scheduling Delay
Scheduling Delay is the time spent from when the collection of streaming jobs for a batch
was submitted to when the first streaming job (out of possibly many streaming jobs in the
collection) was started.
The values in the timeline (the first column) depict the time between the events
Note StreamingListenerBatchSubmitted and StreamingListenerBatchStarted (with
minor yet additional delays to deliver the events).
You may see increase in scheduling delay in the timeline when streaming jobs are queued
up as in the following example:
Processing Time
Processing Time is the time spent to complete all the streaming jobs of a batch.
Total Delay
Total Delay is the time spent from submitting to complete all jobs of a batch.
Active Batches
Active Batches section presents waitingBatches and runningBatches together.
Completed Batches
Completed Batches section presents retained completed batches (using
completedBatchUIData ).
Figure 7. Two Batches with Incoming Data inside for Kafka Direct Stream in web UI
(Streaming tab)
Figure 8. Two Jobs for Kafka Direct Stream in web UI (Jobs tab)
Streaming Listeners
Streaming listeners are listeners interested in streaming events like batch submitted,
started or completed.
StreamingJobProgressListener
RateController
StreamingListenerEvent Events
StreamingListenerBatchSubmitted is posted when streaming jobs are submitted for
has completed, i.e. all the streaming jobs in JobSet have stopped their execution.
StreamingJobProgressListener
StreamingJobProgressListener is a streaming listener that collects information for
onBatchSubmitted
For StreamingListenerBatchSubmitted(batchInfo: BatchInfo) events, it stores batchInfo
batch information in the internal waitingBatchUIData registry per batch time.
onBatchStarted
Caution FIXME
onBatchCompleted
Caution FIXME
Retained Batches
retainedBatches are waiting, running, and completed batches that web UI uses to display
streaming statistics.
Checkpointing
Checkpointing is a process of writing received records (by means of input dstreams) at
checkpoint intervals to a highly-available HDFS-compatible storage. It allows creating fault-
tolerant stream processing pipelines so when a failure occurs input dstreams can restore
the before-failure streaming state and continue stream processing (as if nothing had
happened).
ssc.checkpoint("_checkpoint")
You can also create a brand new StreamingContext (and putting checkpoints
Note
aside).
Checkpointing 347
Mastering Apache Spark
You must not create input dstreams using a StreamingContext that has been
Warning recreated from checkpoint. Otherwise, you will not start the
StreamingContext at all.
When you use StreamingContext(path: String) constructor (or the variants thereof), it uses
Hadoop configuration to access path directory on a Hadoop-supported file system.
SparkContext and batch interval are set to their corresponding values using the
Note
checkpoint file.
ssc.checkpoint(checkpointDir)
ssc
}
val ssc = StreamingContext.getOrCreate(checkpointDir, createSC)
DStreamCheckpointData
Checkpointing 348
Mastering Apache Spark
It tracks checkpoint data in the internal data registry that records batch time and the
checkpoint data at that time. The internal checkpoint data can be anything that a dstream
wants to checkpoint. DStreamCheckpointData returns the registry when
currentCheckpointFiles method is called.
Refer to Logging.
update collects batches and the directory names where the corresponding RDDs were
The collection of the batches and their checkpointed RDDs is recorded in an internal field for
serialization (i.e. it becomes the current value of the internal field currentCheckpointFiles
that is serialized when requested).
Checkpointing 349
Mastering Apache Spark
cleanup deletes checkpoint files older than the oldest batch for the input time .
It first gets the oldest batch time for the input time (see Updating Collection of Batches and
Checkpoint Directories (update method)).
If the (batch) time has been found, all the checkpoint files older are deleted (as tracked in
the internal timeToCheckpointFile mapping).
For each checkpoint file successfully deleted, you should see the following INFO message in
the logs:
WARN Error deleting old checkpoint file '[file]' for time [time]
Otherwise, when no (batch) time has been found for the given input time , you should see
the following DEBUG message in the logs:
restore(): Unit
restore restores the dstream’s generatedRDDs given persistent internal data mapping
Checkpointing 350
Mastering Apache Spark
restore takes the current checkpoint files and restores checkpointed RDDs from each
You should see the following INFO message in the logs per checkpoint file:
INFO Restoring checkpointed RDD for time [time] from file '[file]'
Checkpoint
Checkpoint class requires a StreamingContext and checkpointTime time to be instantiated.
It is merely a collection of the settings of the current streaming runtime environment that is
supposed to recreate the environment after it goes down due to a failure or when the
streaming context is stopped immediately.
It collects the settings from the input StreamingContext (and indirectly from the
corresponding JobScheduler and SparkContext):
Checkpointing 351
Mastering Apache Spark
Refer to Logging.
write the input checkpoint object with and returns the result as a collection of bytes.
compression codec and once read the just-built Checkpoint object is validated and returned
back.
Note deserialize is called when reading the latest valid checkpoint file.
validate(): Unit
validate validates the Checkpoint. It ensures that master , framework , graph , and
You should see the following INFO message in the logs when the object passes the
validation:
Checkpointing 352
Mastering Apache Spark
The method sorts the checkpoint files by time with a temporary .bk checkpoint file first
(given a pair of a checkpoint file and its backup file).
CheckpointWriter
An instance of CheckpointWriter is created (lazily) when JobGenerator is, but only when
JobGenerator is configured for checkpointing.
It uses the internal single-thread thread pool executor to execute checkpoint writes
asynchronously and does so until it is stopped.
write method serializes the checkpoint object and passes the serialized form to
It is called when JobGenerator receives DoCheckpoint event and the batch time
Note
is eligible for checkpointing.
If the asynchronous checkpoint write fails, you should see the following ERROR in the logs:
Checkpointing 353
Mastering Apache Spark
ERROR Could not submit checkpoint task to the thread pool executor
stop(): Unit
CheckpointWriter uses the internal stopped flag to mark whether it is stopped or not.
stop method checks the internal stopped flag and returns if it says it is stopped already.
If not, it orderly shuts down the internal single-thread thread pool executor and awaits
termination for 10 seconds. During that time, any asynchronous checkpoint writes can be
safely finished, but no new tasks will be accepted.
The wait time before executor stops is fixed, i.e. not configurable, and is set to
Note
10 seconds.
After 10 seconds, when the thread pool did not terminate, stop stops it forcefully.
INFO CheckpointWriter: CheckpointWriter executor terminated? [terminated], waited for [time] ms.
It shuts down when CheckpointWriter is stopped (with a 10-second graceful period before it
terminated forcefully).
CheckpointWriteHandler — Asynchronous Checkpoint
Writes
CheckpointWriteHandler is an (internal) thread of execution that does checkpoint writes. It is
instantiated with checkpointTime , the serialized form of the checkpoint, and whether or not
to clean checkpoint data later flag (as clearCheckpointDataLater ).
Checkpointing 354
Mastering Apache Spark
Note It is only used by CheckpointWriter to queue a checkpoint write for a batch time.
It records the current checkpoint time (in latestCheckpointTime ) and calculates the name of
the checkpoint file.
It uses a backup file to do atomic write, i.e. it writes to the checkpoint backup file first and
renames the result file to the final checkpoint file name.
When attempting to write, you should see the following INFO message in the logs:
It deletes any checkpoint backup files that may exist from the previous
Note
attempts.
It then deletes checkpoint files when there are more than 10.
The number of checkpoint files when the deletion happens, i.e. 10, is fixed and
Note
not configurable.
If all went fine, you should see the following INFO message in the logs:
INFO CheckpointWriter: Checkpoint for time [checkpointTime] ms saved to file '[checkpointFile]', took
JobGenerator is informed that the checkpoint write completed (with checkpointTime and
clearCheckpointDataLater flag).
In case of write failures, you can see the following WARN message in the logs:
Checkpointing 355
Mastering Apache Spark
If the number of write attempts exceeded (the fixed) 10 or CheckpointWriter was stopped
before any successful checkpoint write, you should see the following WARN message in the
logs:
WARN CheckpointWriter: Could not write checkpoint for time [checkpointTime] to file [checkpointFile]'
CheckpointReader
CheckpointReader is a private[streaming] helper class to read the latest valid checkpoint
read methods read the latest valid checkpoint file from the checkpoint directory
checkpointDir . They differ in whether Spark configuration conf and Hadoop configuration
The first read throws no SparkException when no checkpoint file could be read.
Note It appears that no part of Spark Streaming uses the simplified version of read .
read uses Apache Hadoop’s Path and Configuration to get the checkpoint files (using
Checkpointing 356
Mastering Apache Spark
The method reads all the checkpoints (from the youngest to the oldest) until one is
successfully loaded, i.e. deserialized.
You should see the following INFO message in the logs just before deserializing a
checkpoint file :
If the checkpoint file was loaded, you should see the following INFO messages in the logs:
In case of any issues while loading a checkpoint file, you should see the following WARN in
the logs and the corresponding exception:
Checkpointing 357
Mastering Apache Spark
JobScheduler
Streaming scheduler ( JobScheduler ) schedules streaming jobs to be run as Spark jobs. It
is created as part of creating a StreamingContext and starts with it.
Refer to Logging.
start(): Unit
JobScheduler 358
Mastering Apache Spark
When JobScheduler starts (i.e. when start is called), you should see the following
DEBUG message in the logs:
It then goes over all the dependent services and starts them one by one as depicted in the
figure.
It asks DStreamGraph for input dstreams and registers their RateControllers (if defined) as
streaming listeners. It starts StreamingListenerBus afterwards.
It starts JobGenerator.
Just before start finishes, you should see the following INFO message in the logs:
Caution FIXME
JobScheduler 359
Mastering Apache Spark
ReceiverTracker is stopped.
If the stop should wait for all received data to be processed (the input parameter
processAllReceivedData is true ), stop awaits termination of jobExecutor Thread Pool for
1 hour (it is assumed that it is enough and is not configurable). Otherwise, it waits for 2
seconds.
JobScheduler 360
Mastering Apache Spark
When no streaming jobs are inside the jobSet , you should see the following INFO in the
logs:
Otherwise, when there is at least one streaming job inside the jobSet ,
StreamingListenerBatchSubmitted (with data statistics of every registered input stream for
which the streaming jobs were generated) is posted to StreamingListenerBus.
It then goes over every streaming job in the jobSet and executes a JobHandler (on
jobExecutor Thread Pool).
At the end, you should see the following INFO message in the logs:
JobHandler
JobHandler is a thread of execution for a streaming job (that simply calls Job.run ).
When started, it prepares the environment (so the streaming job can be nicely displayed in
the web UI under /streaming/batch/?id=[milliseconds] ) and posts JobStarted event to
JobSchedulerEvent event loop.
It runs the streaming job that executes the job function as defined while generating a
streaming job for an output stream.
You may see similar-looking INFO messages in the logs (it depends on the operators you
use):
JobScheduler 361
Mastering Apache Spark
It is used to execute JobHandler for jobs in JobSet (see submitJobSet in this document).
handleJobStart(job: Job, startTime: Long) takes a JobSet (from jobSets ) and checks
INFO JobScheduler: Starting job [job.id] from job set of time [jobSet.time] ms
JobScheduler 362
Mastering Apache Spark
handleJobCompletion looks the JobSet up (from the jobSets internal registry) and calls
INFO JobScheduler: Finished job [job.id] from job set of time [jobSet.time] ms
INFO JobScheduler: Total delay: [totalDelay] s for time [time] ms (execution: [processingDelay] s)
Internal Registries
JobScheduler maintains the following information in internal registries:
JobSet
JobScheduler 363
Mastering Apache Spark
A JobSet represents a collection of streaming jobs that were created at (batch) time for
output streams (that have ultimately produced a streaming job as they may opt out).
registry).
Note At the beginning (when JobSet is created) all streaming jobs are incomplete.
processingStartTime being the time when the first streaming job in the collection was
started.
processingEndTime being the time when the last streaming job in the collection finished
processing.
JobScheduler 364
Mastering Apache Spark
Processing delay is the time spent for processing all the streaming jobs in a JobSet
from the time the very first job was started, i.e. the time between started and completed
states.
Total delay is the time from the batch time until the JobSet was completed.
JobGenerator.generateJobs
JobScheduler.submitJobSet(jobSet: JobSet)
JobGenerator.restart
InputInfoTracker
InputInfoTracker tracks batch times and batch statistics for input streams (per input stream
id with StreamInputInfo ). It is later used when JobGenerator submits streaming jobs for a
batch time (and propagated to interested listeners as StreamingListenerBatchSubmitted
event).
batch times and input streams (i.e. another mapping between input stream ids and
StreamInputInfo ).
JobScheduler 365
Mastering Apache Spark
It accumulates batch statistics at every batch time when input streams are computing RDDs
(and explicitly call InputInfoTracker.reportInfo method).
Cleaning up
You should see the following INFO message when cleanup of old batch times is requested
(akin to garbage collection):
JobScheduler 366
Mastering Apache Spark
JobGenerator
JobGenerator asynchronously generates streaming jobs every batch interval (using
recurring timer) that may or may not be checkpointed afterwards. It also periodically
requests clearing up metadata and checkpoint data for each input dstream.
Refer to Logging.
start(): Unit
Figure 1. JobGenerator Start (First Time) procedure (tip: follow the numbers)
It first checks whether or not the internal event loop has already been created which is the
way to know that the JobScheduler was started. If so, it does nothing and exits.
JobGenerator 367
Mastering Apache Spark
startFirstTime(): Unit
It first requests timer for the start time and passes the start time along to
DStreamGraph.start and RecurringTimer.start.
The start time has the property of being a multiple of batch interval and after the
Note current system time. It is in the hands of recurring timer to calculate a time with
the property given a batch interval.
When recurring timer starts for JobGenerator , you should see the following
INFO message in the logs:
Note
INFO RecurringTimer: Started timer for JobGenerator at time [nextTime]
Right before the method finishes, you should see the following INFO message in the logs:
JobGenerator gracefully, i.e. after having processed all received data and pending
JobGenerator 368
Mastering Apache Spark
It first checks whether eventLoop internal event loop was ever started (through checking
null ).
When JobGenerator should stop immediately, i.e. ignoring unprocessed data and pending
streaming jobs ( processReceivedData flag is disabled), you should see the following INFO
message in the logs:
It requests the timer to stop forcefully ( interruptTimer is enabled) and stops the graph.
You should immediately see the following INFO message in the logs:
INFO JobGenerator: Waiting for all received blocks to be consumed for job generation
ReceiverTracker has any blocks left to be processed (whatever is shorter) before continuing.
When a timeout occurs, you should see the WARN message in the logs:
WARN JobGenerator: Timed out while stopping the job generator (timeout = [stopTimeoutMs])
After the waiting is over, you should see the following INFO message in the logs:
INFO JobGenerator: Waited for all received blocks to be consumed for job generation
JobGenerator 369
Mastering Apache Spark
It requests timer to stop generating streaming jobs ( interruptTimer flag is disabled) and
stops the graph.
You should immediately see the following INFO message in the logs:
batches have been processed (whatever is shorter) before continuing. It waits for batches to
complete using last processed batch internal property that should eventually be exactly the
time when the timer was stopped (it returns the last time for which the streaming job was
generated).
After the waiting is over, you should see the following INFO message in the logs:
As the last step, when JobGenerator is assumed to be stopped completely, you should see
the following INFO message in the logs:
restart(): Unit
environment of the past execution that may have stopped immediately, i.e. without waiting
for all the streaming jobs to complete when checkpoint was enabled, or due to a abrupt
shutdown (a unrecoverable failure or similar).
JobGenerator 370
Mastering Apache Spark
restart first calculates the batches that may have been missed while JobGenerator was
down, i.e. batch times between the current restart time and the time of initial checkpoint.
restart doesn’t check whether the initial checkpoint exists or not that may
Warning
lead to NPE.
It then ask the initial checkpoint for pending batches, i.e. the times of streaming job sets.
Caution FIXME What are the pending batches? Why would they ever exist?
It then computes the batches to reschedule, i.e. pending and down time batches that are
before restart time.
JobGenerator 371
Mastering Apache Spark
The only purpose of the lastProcessedBatch property is to allow for stopping the streaming
context gracefully, i.e. to wait until all generated streaming jobs are completed.
For every JobGeneratorEvent event, you should see the following DEBUG message in the
logs:
GenerateJobs
DoCheckpoint
ClearMetadata
ClearCheckpointData
See below in the document for the extensive coverage of the supported JobGeneratorEvent
event types.
generateJobs(time: Time)
JobGenerator 372
Mastering Apache Spark
If the above two calls have finished successfully, InputInfoTracker is requested for data
statistics of every registered input stream for the given batch time that together with the
collection of streaming jobs (from DStreamGraph) is passed on to
JobScheduler.submitJobSet (as a JobSet).
If checkpointing is disabled or the current batch time is not eligible for checkpointing, the
method does nothing and exits.
A current batch is eligible for checkpointing when the time interval between
Note
current batch time and zero time is a multiple of checkpoint interval.
FIXME Who checks and when whether checkpoint interval is greater than
Caution batch interval or not? What about checking whether a checkpoint interval is a
multiple of batch time?
Otherwise, when checkpointing should be performed, you should see the following INFO
message in the logs:
It requests DStreamGraph for updating checkpoint data and CheckpointWriter for writing a
new checkpoint. Both are given the current batch time .
JobGenerator 373
Mastering Apache Spark
Note ClearMetadata are posted after a micro-batch for a batch time has completed.
It removes old RDDs that have been generated and collected so far by output streams
(managed by DStreamGraph). It is a sort of garbage collector.
When ClearMetadata(time) arrives, it first asks DStreamGraph to clear metadata for the
given time.
Eventually, it marks the batch as fully processed, i.e. that the batch completed as well as
checkpointing or metadata cleanups, using the internal lastProcessedBatch marker.
clearCheckpointData(time: Time)
JobGenerator 374
Mastering Apache Spark
It then asks DStreamGraph for the maximum remember interval. Given the maximum
remember interval JobGenerator requests ReceiverTracker to cleanup old blocks and
batches and InputInfoTracker to do cleanup for data accumulated before the maximum
remember interval (from time ).
Having done that, the current batch time is marked as fully processed.
shouldCheckpoint flag is enabled (i.e. true ) when checkpoint interval and checkpoint
When and what for are they set? Can one of ssc.checkpointDuration and
Caution ssc.checkpointDir be null ? Do they all have to be set and is this checked
somewhere?
Answer: See Setup Validation.
JobGenerator 375
Mastering Apache Spark
onCheckpointCompletion
Caution FIXME
timer RecurringTimer
timer RecurringTimer (with the name being JobGenerator ) is used to posts GenerateJobs
timer is created when JobGenerator is. It starts when JobGenerator starts (for
Note
the first time only).
JobGenerator 376
Mastering Apache Spark
DStreamGraph
DStreamGraph (is a final helper class that) manages input and output dstreams. It also
holds zero time for the other components that marks the time when it was started.
output DStream instances (as outputStreams ), but, more importantly, it generates streaming
jobs for output streams for a batch (time).
DStreamGraph holds the batch interval for the other parts of a Streaming application.
Refer to Logging.
It is passed on down the output dstream graph so output dstreams can initialize themselves.
Streaming application.
DStreamGraph 377
Mastering Apache Spark
It appears that it is the place for the value since it must be set before JobGenerator can be
instantiated.
getMaxInputStreamRememberDuration(): Duration
Maximum Remember Interval is the maximum remember interval across all the input
dstreams. It is calculated using getMaxInputStreamRememberDuration method.
Caution FIXME
you need to register a dstream (using DStream.register method) which happens for…FIXME
Starting DStreamGraph
When DStreamGraph is started (using start method), it sets zero time and start time.
start method is called when JobGenerator starts for the first time (not from a
Note
checkpoint).
You can start DStreamGraph as many times until time is not null and zero
Note
time has been set.
(output dstreams) start then walks over the collection of output dstreams and for each
output dstream, one at a time, calls their initialize(zeroTime), remember (with the current
remember interval), and validateAtStart methods.
DStreamGraph 378
Mastering Apache Spark
(input dstreams) When all the output streams are processed, it starts the input dstreams (in
parallel) using start method.
Stopping DStreamGraph
stop(): Unit
Caution FIXME
Restarting DStreamGraph
Note This is the only moment when zero time can be different than start time.
generateJobs method generates a collection of streaming jobs for output streams for a
given batch time . It walks over each registered output stream (in outputStreams internal
registry) and requests each stream for a streaming job
When generateJobs method executes, you should see the following DEBUG message in
the logs:
generateJobs then walks over each registered output stream (in outputStreams internal
DStreamGraph 379
Mastering Apache Spark
Right before the method finishes, you should see the following DEBUG message with the
number of streaming jobs generated (as jobs.length ):
Validation Check
validate() method checks whether batch duration and at least one output stream have
Metadata Cleanup
When clearMetadata(time: Time) is called, you should see the following DEBUG message
in the logs:
It merely walks over the collection of output streams and (synchronously, one by one) asks
to do its own metadata cleaning.
When finishes, you should see the following DEBUG message in the logs:
restoreCheckpointData(): Unit
When restoreCheckpointData() is executed, you should see the following INFO message in
the logs:
DStreamGraph 380
Mastering Apache Spark
At the end, you should see the following INFO message in the logs:
When updateCheckpointData is called, you should see the following INFO message in the
logs:
It then walks over every output dstream and calls its updateCheckpointData(time).
When updateCheckpointData finishes it prints out the following INFO message to the logs:
Checkpoint Cleanup
clearCheckpointData(time: Time)
When clearCheckpointData is called, you should see the following INFO message in the
logs:
It merely walks through the collection of output streams and (synchronously, one by one)
asks to do their own checkpoint data cleaning.
When finished, you should see the following INFO message in the logs:
DStreamGraph 381
Mastering Apache Spark
Remember Interval
Remember interval is the time to remember (aka cache) the RDDs that have been
generated by (output) dstreams in the context (before they are released and garbage
collected).
remember method
It first checks whether or not it has been set already and if so, throws
java.lang.IllegalArgumentException as follows:
DStreamGraph 382
Mastering Apache Spark
There is no notion of input and output dstreams. DStreams are all instances of DStream
abstract class (see DStream Contract in this document). You may however correctly assume
that all dstreams are input. And it happens to be so until you register a dstream that marks it
as output.
Refer to Logging.
DStream Contract
A DStream is defined by the following properties (with the names of the corresponding
methods that subclasses have to implement):
dstream dependencies, i.e. a collection of DStreams that this DStream depends on.
They are often referred to as parent dstreams.
slide duration (aka slide interval), i.e. a time interval after which the stream is
requested to generate a RDD out of input data it consumes.
How to compute (generate) an optional RDD for the given batch if any. validTime is a
point in time that marks the end boundary of slide duration.
Creating DStreams
You can create dstreams through the built-in input stream constructors using streaming
context or more specialized add-ons for external input data sources, e.g. Apache Kafka.
It serves as the initialization marker (via isInitialized method) and helps calculating
intervals for RDD checkpointing (when checkpoint interval is set and the current batch time
is a multiple thereof), slicing, and the time validation for a batch (when a dstream generates
a RDD).
Initially, when a dstream is created, the remember interval is not set (i.e. null ), but is set
when the dstream is initialized.
You may see the current value of remember interval when a dstream is
Note
validated at startup and the log level is INFO.
that were generated for the batch. It acts as a cache when a dstream is requested to
compute a RDD for batch (i.e. generatedRDDs may already have the RDD or gets a new
RDD added).
As new RDDs are added, dstreams offer a way to clear the old metadata during which the
old RDDs are removed from generatedRDDs collection.
initialize method sets zero time and optionally checkpoint interval (if the dstream must
checkpoint and the interval was not set already) and remember duration.
The zero time of a dstream can only be set once or be set again to the same zero time.
Otherwise, it throws SparkException as follows:
If mustCheckpoint is enabled and the checkpoint interval was not set, it is automatically set
to the slide interval or 10 seconds, whichever is longer. You should see the following INFO
message in the logs when the checkpoint interval was set automatically:
It then ensures that remember interval is at least twice the checkpoint interval (only if
defined) or the slide duration.
At the very end, it initializes the parent dstreams (available as dependencies) that
recursively initializes the entire graph of dstreams.
remember Method
remember sets remember interval for the current dstream and the dstreams it depends on
(see dependencies).
If the input duration is specified (i.e. not null ), remember allows setting the remember
interval (only when the current value was not set already) or extend it (when the current
value is shorter).
You should see the following INFO message in the logs when the remember interval
changes:
At the end, remember always sets the current remember interval (whether it was set,
extended or did not change).
You can only enable checkpointing and set the checkpoint interval before StreamingContext
is started or UnsupportedOperationException is thrown as follows:
Internally, checkpoint method calls persist (that sets the default MEMORY_ONLY_SER storage
level).
If checkpoint interval is set, the checkpoint directory is mandatory. Spark validates it when
StreamingContext starts and throws a IllegalArgumentException exception if not set.
java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Pl
You can see the value of the checkpoint interval for a dstream in the logs when it is
validated:
Checkpointing
DStreams can checkpoint input data at specified time intervals.
The following settings are internal to a dstream and define how it checkpoints the input data
if any.
Refer to Initializing DStreams (initialize method) to learn how it is used to set the
checkpoint interval, i.e. checkpointDuration .
checkpoints data. It is often called checkpoint interval. If not set explicitly, but the
dstream is checkpointed, it will be while initializing dstreams.
state of a dstream, i.e.. whether ( true ) or not ( false ) it was started by restoring state
from checkpoint.
register(): DStream[T]
DStream comes with internal register method that registers a DStream as an output
stream.
The internal private foreachRDD method uses register to register output streams to
DStreamGraph. Whenever called, it creates ForEachDStream and calls register upon it.
That is how streams become output streams.
The internal generateJob method generates a streaming job for a batch time for a (output)
dstream. It may or may not generate a streaming job for the requested batch time .
It computes an RDD for the batch and, if there is one, returns a streaming job for the batch
time and a job function that will run a Spark job (with the generated RDD and the job
Note The Spark job uses an empty function to calculate partitions of a RDD.
It uses generatedRDDs to return the RDD if it has already been generated for the time . If
not, it generates one by computing the input stream (using compute(validTime: Time)
method).
If there was anything to process in the input stream, i.e. computing the input stream returned
a RDD, the RDD is first persisted (only if storageLevel for the input stream is different from
StorageLevel.NONE ).
The generated RDD is checkpointed if checkpointDuration is defined and the time interval
between current and zero times is a multiple of checkpointDuration.
Checkpoint Cleanup
Caution FIXME
restoreCheckpointData
restoreCheckpointData(): Unit
Once completed, the internal restoredFromCheckpointData flag is enabled (i.e. true ) and
you should see the following INFO message in the logs:
Metadata Cleanup
Note It is called when DStreamGraph clears metadata for every output stream.
clearMetadata(time: Time) is called to remove old RDDs that have been generated so far
It collects generated RDDs (from generatedRDDs) that are older than rememberDuration.
Regardless of spark.streaming.unpersist flag, all the collected RDDs are removed from
generatedRDDs.
When spark.streaming.unpersist flag is set (it is by default), you should see the following
DEBUG message in the logs:
For every RDD in the list, it unpersists them (without blocking) one by one and explicitly
removes blocks for BlockRDDs. You should see the following INFO message in the logs:
After RDDs have been removed from generatedRDDs (and perhaps unpersisted), you
should see the following DEBUG message in the logs:
DEBUG Cleared [size] RDDs that were older than [time]: [time1, time2, ...]
updateCheckpointData
When updateCheckpointData is called, you should see the following DEBUG message in the
logs:
When updateCheckpointData finishes, you should see the following DEBUG message in the
logs:
Internal Registries
DStream implementations maintain the following internal properties:
Input DStreams
Input DStreams in Spark Streaming are the way to ingest data from external data sources.
They are represented as InputDStream abstract class.
InputDStream is the abstract base class for all input DStreams. It provides two abstract
methods start() and stop() to start and stop ingesting data, respectively.
Name your custom InputDStream using the CamelCase notation with the suffix
Tip
InputDStream, e.g. MyCustomInputDStream.
Custom implementations of InputDStream can override (and actually provide!) the optional
RateController. It is undefined by default.
package pl.japila.spark.streaming
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{ Time, StreamingContext }
import org.apache.spark.streaming.dstream.InputDStream
import scala.reflect.ClassTag
Receiver input streams run receivers as long-running tasks that occupy a core
Note
per stream.
ReceiverInputDStream abstract class defines the following abstract method that custom
The receiver is then sent to and run on workers (when ReceiverTracker is started).
spark.streaming.backpressure.enabled is enabled.
If the time to generate RDDs ( validTime ) is earlier than the start time of StreamingContext,
an empty BlockRDD is generated.
Otherwise, ReceiverTracker is requested for all the blocks that have been allocated to this
stream for this batch (using ReceiverTracker.getBlocksOfBatch ).
The number of records received for the batch for the input stream (as StreamInputInfo aka
input blocks information) is registered to InputInfoTracker (using
InputInfoTracker.reportInfo ).
Back Pressure
Caution FIXME
Back pressure for input dstreams with receivers can be configured using
spark.streaming.backpressure.enabled setting.
ConstantInputDStreams
ConstantInputDStream is an input stream that always returns the same mandatory input
Example
ForEachDStreams
ForEachDStream is an internal DStream with dependency on the parent stream with the
When generateJob is called, it returns a streaming job for a batch when parent stream
does. And if so, it uses the "foreach" function (given as foreachFunc ) to work on the RDDs
generated.
Although it may seem that ForEachDStreams are by design output streams they
are not. You have to use DStreamGraph.addOutputStream to register a stream
as output.
Note
You use stream operators that do the registration as part of their operation, like
print .
WindowedDStreams
WindowedDStream (aka windowed stream) is an internal DStream with dependency on the
parent stream.
Obviously, slide duration of the stream is given explicitly (and must be a multiple of the
parent’s slide duration).
window duration.
depending on the number of the partitioners defined by the RDDs in the window. It uses slice
operator on the parent stream (using the slice window of [now - windowDuration +
parent.slideDuration, now] ).
Otherwise, when there are multiple different partitioners in use, UnionRDD is created and
you should see the following DEBUG message in the logs:
log4j.logger.org.apache.spark.streaming.dstream.WindowedDStream=DEBUG
MapWithStateDStream
MapWithStateDStream is the result of mapWithState stateful operator.
MapWithStateDStreamImpl
MapWithStateDStreamImpl is an internal DStream with dependency on the parent
dataStream key-value dstream. It uses a custom internal dstream called internalStream (of
type InternalMapWithStateDStream).
The compute method may or may not return a RDD[MappedType] by getOrCompute on the
internal stream and…TK
Caution FIXME
InternalMapWithStateDStream
InternalMapWithStateDStream is an internal dstream to support MapWithStateDStreamImpl
and uses dataStream (as parent of type DStream[(K, V)] ) as well as StateSpecImpl[K, V,
S, E] (as spec ).
It is a DStream[MapWithStateRDDRecord[K, S, E]] .
When initialized, if checkpoint interval is not set, it sets it as ten times longer than the slide
duration of the parent stream (the multiplier is not configurable and always 10 ).
If the RDD is found, it is returned as is given the partitioners of the RDD and the stream are
equal. Otherwise, when the partitioners are different, the RDD is "repartitioned" using
MapWithStateRDD.createFromRDD .
StateDStream
StateDStream is the specialized DStream that is the result of updateStateByKey stateful
operator. It is a wrapper around a parent key-value pair dstream to build stateful pipeline
(by means of updateStateByKey operator) and as a stateful dstream enables checkpointing
(and hence requires some additional setup).
It uses a parent key-value pair dstream, updateFunc update state function, a partitioner ,
a flag whether or not to preservePartitioning and an optional key-value pair initialRDD .
The only dependency of StateDStream is the input parent key-value pair dstream.
It forces checkpointing regardless of the current dstream configuration, i.e. the internal
mustCheckpoint is enabled.
When requested to compute a RDD it first attempts to get the state RDD for the previous
batch (using DStream.getOrCompute). If there is one, parent stream is requested for a
RDD for the current batch (using DStream.getOrCompute). If parent has computed one,
computeUsingPreviousRDD(parentRDD, prevStateRDD) is called.
FIXME When could getOrCompute not return an RDD? How does this apply
Caution
to the StateDStream? What about the parent’s getOrCompute ?
If however parent has not generated a RDD for the current batch but the state RDD
existed, updateFn is called for every key of the state RDD to generate a new state per
partition (using RDD.mapPartitions)
Otherwise, when no state RDD exists, parent stream is requested for a RDD for the
current batch (using DStream.getOrCompute) and when no RDD was generated for the
batch, no computation is triggered.
When the stream processing starts, i.e. no state RDD exists, and there is no
Note
input data received, no computation is triggered.
Given no state RDD and with parent RDD computed, when initialRDD is NONE , the input
data batch (as parent RDD) is grouped by key (using groupByKey with partitioner ) and
then the update state function updateFunc is applied to the partitioned input data (using
mapPartitions) with None state. Otherwise, computeUsingPreviousRDD(parentRDD,
initialStateRDD) is called.
It should be read as given a collection of triples of a key, new records for the key, and the
current state for the key, generate a collection of keys and their state.
computeUsingPreviousRDD
The computeUsingPreviousRDD method uses cogroup and mapPartitions to build the final
state RDD.
Regardless of the return type Option[RDD[(K, S)]] that really allows no state, it
Note
will always return some state.
It is acceptable to end up with keys that have no new records per batch, but
Note these keys do have a state (since they were received previously when no state
might have been built yet).
The signature of cogroup is as follows and applies to key-value pair RDDs, i.e. RDD[(K, V)]
It defines an internal update function finalFunc that maps over the collection of all the
keys, new records per key, and at-most-one-element state per key to build new iterator that
ensures that:
1. a state per key exists (it is None or the state built so far)
With every triple per every key, the internal update function calls the constructor’s
updateFunc.
The state RDD is a cogrouped RDD (on parentRDD and prevStateRDD using the
constructor’s partitioner ) with every element per partition mapped over using the internal
update function finalFunc and the constructor’s preservePartitioning (through
mapPartitions ).
TransformedDStream
TransformedDStream is the specialized DStream that is the result of transform operator.
When created, it asserts that the input collection of dstreams use the same
Note
StreamingContext and slide interval.
The slide interval is exactly the same as that in the first dstream in parents .
When requested to compute a RDD, it goes over every dstream in parents and asks to
getOrCompute a RDD.
It may throw a SparkException when a dstream does not compute a RDD for a
Note
batch.
Receivers
Receivers run on workers to receive external data. They are created and belong to
ReceiverInputDStreams.
The abstract Receiver class requires the following methods to be implemented (see
Custom Receiver):
A receiver uses store methods to store received data as data blocks into Spark’s memory.
A receiver can be in one of the three states: Initialized , Started , and Stopped .
Custom Receiver
Receivers 406
Mastering Apache Spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.receiver.Receiver
def onStart() = {
println("onStart called")
}
def onStop() = {
println("onStop called")
}
}
ssc.start
ssc.stop()
Receivers 407
Mastering Apache Spark
ReceiverTracker
Introduction
ReceiverTracker manages execution of all Receivers.
It can only be started once and only when at least one input receiver has been registered.
Started -
Stopping
Stopped
You can only start ReceiverTracker once and multiple attempts lead to throwing
Note
SparkException exception.
Receivers 408
Mastering Apache Spark
It then launches receivers, i.e. it collects receivers for all registered ReceiverDStream and
posts them as StartAllReceivers to ReceiverTracker RPC endpoint.
In the meantime, receivers have their ids assigned that correspond to the unique identifier of
their ReceiverDStream .
A successful startup of ReceiverTracker finishes with the following INFO message in the
logs:
Caution FIXME
hasUnallocatedBlocks
Caution FIXME
StartAllReceivers
StartAllReceivers(receivers) is a local message sent by ReceiverTracker when it starts
(using ReceiverTracker.launchReceivers() ).
Receivers 409
Mastering Apache Spark
ReceiverTrackerEndpoint.startReceiver
FIXME When the scaladoc says "along with the scheduled executors", does
Caution
it mean that the executors are already started and waiting for the receiver?!
Namely, the internal startReceiverFunc function checks that the task attempt is 0 .
It then starts a ReceiverSupervisor for receiver and keeps awaiting termination, i.e. once
the task is run it does so until a termination message comes from some other external
source). The task is a long-running task for receiver .
Otherwise, it distributes the one-element collection across the nodes (and potentially even
executors) for receiver . The RDD has the name Receiver [receiverId] .
The Spark job’s description is set to Streaming job running receiver [receiverId] .
Receivers 410
Mastering Apache Spark
The method demonstrates how you could use Spark Core as the distributed
computation platform to launch any process on clusters and let Spark handle
Note the distribution.
When there was a failure submitting the job, you should also see the ERROR message in
the logs:
Ultimately, right before the method exits, the following INFO message appears in the logs:
StopAllReceivers
Caution FIXME
AllReceiverIds
Caution FIXME
Receivers 411
Mastering Apache Spark
It then sends the stop signal to all the receivers (i.e. posts StopAllReceivers to
ReceiverTracker RPC endpoint) and waits 10 seconds for all the receivers to quit gracefully
(unless graceful flag is set).
Note The 10-second wait time for graceful quit is not configurable.
You should see the following INFO messages if the graceful flag is enabled which means
that the receivers quit in a graceful manner:
It then checks whether all the receivers have been deregistered or not by posting
AllReceiverIds to ReceiverTracker RPC endpoint.
You should see the following INFO message in the logs if they have:
Otherwise, when there were receivers not having been deregistered properly, the following
WARN message appears in the logs:
Receivers 412
Mastering Apache Spark
Note When there are no receiver input streams in use, the method does nothing.
ReceivedBlockTracker
Caution FIXME
You should see the following INFO message in the logs when cleanupOldBatches is called:
allocateBlocksToBatch Method
If so, it grabs all unallocated blocks per stream (using getReceivedBlockQueue method) and
creates a map of stream ids and sequences of their ReceivedBlockInfo . It then writes the
received blocks to write-ahead log (WAL) (using writeToLog method).
allocateBlocksToBatch stores the allocated blocks with the current batch time in
If there has been an error while writing to WAL or the batch time is older than
lastAllocatedBatchTime , you should see the following INFO message in the logs:
INFO Possibly processed batch [batchTime] needs to be processed again in WAL recovery
Receivers 413
Mastering Apache Spark
ReceiverSupervisors
ReceiverSupervisor is an (abstract) handler object that is responsible for supervising a
receiver (that runs on the worker). It assumes that implementations offer concrete methods
to push received data to Spark.
ReceiverSupervisor Contract
ReceiverSupervisor is a private[streaming] abstract class that assumes that concrete
pushBytes
pushIterator
pushArrayBuffer
createBlockGenerator
reportError
onReceiverStart
Starting Receivers
startReceiver() calls (abstract) onReceiverStart() . When true (it is unknown at this
Receivers 414
Mastering Apache Spark
The receiver’s onStart() is called and another INFO message appears in the logs:
Stopping Receivers
stop method is called with a message and an optional cause of the stop (called error ). It
calls stopReceiver method that prints the INFO message and checks the state of the
receiver to react appropriately.
When the receiver is in Started state, stopReceiver calls Receiver.onStop() , prints the
following INFO message, and onReceiverStop(message, error) .
Restarting Receivers
A ReceiverSupervisor uses spark.streaming.receiverRestartDelay to restart the receiver
with delay.
It then stops the receiver, sleeps for delay milliseconds and starts the receiver (using
startReceiver() ).
Receivers 415
Mastering Apache Spark
Awaiting Termination
awaitTermination method blocks the current thread to wait for the receiver to be stopped.
When called, you should see the following INFO message in the logs:
If a receiver has terminated successfully, you should see the following INFO message in the
logs:
stoppingError is the exception associated with the stopping of the receiver and is rethrown.
following cases:
Receivers 416
Mastering Apache Spark
ReceiverSupervisorImpl
ReceiverSupervisorImpl is the implementation of ReceiverSupervisor contract.
It communicates with ReceiverTracker that runs on the driver (by posting messages using
the ReceiverTracker RPC endpoint).
log4j.logger.org.apache.spark.streaming.receiver.ReceiverSupervisorImpl=DEBUG
push Methods
push methods, i.e. pushArrayBuffer , pushIterator , and pushBytes solely pass calls on to
ReceiverSupervisorImpl.pushAndReportBlock.
ReceiverSupervisorImpl.onReceiverStart
ReceiverSupervisorImpl.onReceiverStart sends a blocking RegisterReceiver message to
(using getCurrentLimit ).
ReceivedBlockHandler
Receivers 417
Mastering Apache Spark
ReceivedBlockHandler to use.
ReceiverSupervisorImpl.pushAndReportBlock
ReceiverSupervisorImpl.pushAndReportBlock(receivedBlock: ReceivedBlock, metadataOption:
driver).
When a response comes, you should see the following DEBUG message in the logs:
Receivers 418
Mastering Apache Spark
ReceivedBlockHandlers
ReceivedBlockHandler represents how to handle the storage of blocks received by receivers.
ReceivedBlockHandler Contract
ReceivedBlockHandler is a private[streaming] trait . It comes with two methods:
cleanupOldBlocks implies that there is a relation between blocks and the time
Note
they arrived.
BlockManagerBasedBlockHandler
BlockManagerBasedBlockHandler is the default ReceivedBlockHandler in Spark Streaming.
cleanupOldBlocks is not used as blocks are cleared by some other means (FIXME)
store ReceivedBlock .
Receivers 419
Mastering Apache Spark
WriteAheadLogBasedBlockHandler
WriteAheadLogBasedBlockHandler is used when
spark.streaming.receiver.writeAheadLog.enable is true .
Receivers 420
Mastering Apache Spark
Using receivers
With no receivers
There is yet another "middle-ground" approach (so-called unofficial since it is not available
by default in Spark Streaming):
…
Streaming mode
You create DirectKafkaInputDStream using KafkaUtils.createDirectStream .
Note Kafka brokers have to be up and running before you can create a direct stream.
// Enable checkpointing
ssc.checkpoint("_checkpoint")
// You may or may not want to enable some additional DEBUG logging
import org.apache.log4j._
Logger.getLogger("org.apache.spark.streaming.dstream.DStream").setLevel(Level.DEBUG)
Logger.getLogger("org.apache.spark.streaming.dstream.WindowedDStream").setLevel(Level.DEBUG
Logger.getLogger("org.apache.spark.streaming.DStreamGraph").setLevel(Level.DEBUG)
Logger.getLogger("org.apache.spark.streaming.scheduler.JobGenerator").setLevel(Level.DEBUG
// Connect to Kafka
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val kafkaTopics = Set("spark-topic")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder
If zookeeper.connect or group.id parameters are not set, they are added with their values
being empty strings.
In this mode, you will only see jobs submitted (in the Jobs tab in web UI) when a message
comes in.
DirectKafkaInputDStream
DirectKafkaInputDStream is an input stream of KafkaRDD batches.
As an input stream, it implements the five mandatory abstract methods - three from
DStream and two from InputDStream :
dependencies on other streams (other than Kafka brokers to read data from).
Method) section.
The name of the input stream is Kafka direct stream [id]. You can find the name in the
Streaming tab in web UI (in the details of a batch in Input Metadata section).
Every time the method is called, latestLeaderOffsets calculates the latest offsets (as
Map[TopicAndPartition, LeaderOffset] ).
Note Every call to compute does call Kafka brokers for the offsets.
The moving parts of generated KafkaRDD instances are offsets. Others are taken directly
from DirectKafkaInputDStream (given at the time of instantiation).
It sets the just-calculated offsets as current (using currentOffsets ) and returns a new
KafkaRDD instance.
Back Pressure
Caution FIXME
Back pressure for Direct Kafka input dstream can be configured using
spark.streaming.backpressure.enabled setting.
Kafka Concepts
broker
leader
topic
partition
offset
exactly-once semantics
LeaderOffset
LeaderOffset is an internal class to represent an offset on the topic partition on the broker
Recommended Reading
Exactly-once Spark Streaming from Apache Kafka
KafkaRDD
KafkaRDD class represents a RDD dataset from Apache Kafka. It uses KafkaRDDPartition
for partitions that know their preferred locations as the host of the topic (not port however!). It
then nicely maps a RDD partition to a Kafka partition.
KafkaRDD overrides methods of RDD class to base them on offsetRanges , i.e. partitions.
Computing Partitions
To compute a partition, KafkaRDD , checks for validity of beginning and ending offsets (so
they range over at least one element) and returns an (internal) KafkaRDDIterator .
INFO KafkaRDD: Computing topic [topic], partition [partition] offsets [fromOffset] -> [toOffset]
It fetches batches of kc.config.fetchMessageMaxBytes size per topic, partition, and offset (it
uses kafka.consumer.SimpleConsumer.fetch(kafka.api.FetchRequest) method).
RecurringTimer
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)
thread prefixed RecurringTimer - [name] that, once started, executes callback in a loop
every period time (until it is stopped).
Refer to Logging.
When RecurringTimer triggers an action for a period , you should see the following
DEBUG message in the logs:
getStartTime(): Long
getRestartTime(originalStartTime: Long): Long
getStartTime calculates a time that is a multiple of the timer’s period and is right after the
parameter, i.e. it calculates a time as getStartTime but shifts the result to accommodate the
time gap since originalStartTime .
RecurringTimer 427
Mastering Apache Spark
Starting Timer
When start is called, it sets the internal nextTime to the given input parameter
startTime and starts the internal daemon thread. This is the moment when the clock starts
ticking…
Stopping Timer
When called, you should see the following INFO message in the logs:
stop method uses the internal stopped flag to mark the stopped state and returns the last
Before it fully terminates, it triggers callback one more/last time, i.e. callback
Note
is executed for a period after RecurringTimer has been (marked) stopped.
RecurringTimer 428
Mastering Apache Spark
Fun Fact
You can execute org.apache.spark.streaming.util.RecurringTimer as a command-line
standalone application.
$ ./bin/spark-class org.apache.spark.streaming.util.RecurringTimer
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
INFO RecurringTimer: Started timer for Test at time 1453787444000
INFO RecurringTimer: 1453787444000: 1453787444000
DEBUG RecurringTimer: Callback for Test called at time 1453787444000
INFO RecurringTimer: 1453787445005: 1005
DEBUG RecurringTimer: Callback for Test called at time 1453787445000
INFO RecurringTimer: 1453787446004: 999
DEBUG RecurringTimer: Callback for Test called at time 1453787446000
INFO RecurringTimer: 1453787447005: 1001
DEBUG RecurringTimer: Callback for Test called at time 1453787447000
INFO RecurringTimer: 1453787448000: 995
DEBUG RecurringTimer: Callback for Test called at time 1453787448000
^C
INFO ShutdownHookManager: Shutdown hook called
INFO ShutdownHookManager: Deleting directory /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T
INFO ShutdownHookManager: Deleting directory /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T
RecurringTimer 429
Mastering Apache Spark
With backpressure you can guarantee that your Spark Streaming application is stable, i.e.
receives data only as fast as it can process it.
You can monitor a streaming application using web UI. It is important to ensure that the
batch processing time is shorter than the batch interval. Backpressure introduces a
feedback loop so the streaming system can adapt to longer processing times and avoid
instability.
RateController
completed updates for a dstream and maintains a rate limit, i.e. an estimate of the speed at
which this stream should ingest messages. With every batch completed update event it
calculates the current processing rate and estimates the correct receving rate.
Backpressure 430
Mastering Apache Spark
rate limit and publishes the current value. The computed value is set as the present rate
limit, and published (using the sole abstract publish method).
Caution FIXME Where is this used? What are the use cases?
RateEstimator
RateEstimator computes the rate given the input time , elements , processingDelay , and
schedulingDelay .
def compute(
time: Long,
elements: Long,
processingDelay: Long,
schedulingDelay: Long): Option[Double]
The PID rate estimator is the only possible estimator. All other rate
Warning
estimators lead to IllegalArgumentException being thrown.
Backpressure 431
Mastering Apache Spark
Refer to Logging.
When the PID rate estimator is created you should see the following INFO message in the
logs:
When the pid rate estimator computes the rate limit for the current time, you should see the
following TRACE message in the logs:
TRACE PIDRateEstimator:
time = [time], # records = [numElements], processing time = [processingDelay], scheduling delay = [sc
If the time to compute the current rate limit for is before the latest time or the number of
records is 0 or less, or processing delay is 0 or less, the rate estimation is skipped. You
should see the following TRACE message in the logs:
Otherwise, when this is to compute the rate estimation for next time and there are records
processed as well as the processing delay is positive, it computes the rate estimate.
Backpressure 432
Mastering Apache Spark
Once the new rate has already been computed, you should see the following TRACE
message in the logs:
TRACE PIDRateEstimator:
latestRate = [latestRate], error = [error]
latestError = [latestError], historicalError = [historicalError]
delaySinceUpdate = [delaySinceUpdate], dError = [dError]
If it was the first computation of the limit rate, you should see the following TRACE message
in the logs:
Otherwise, when it is another limit rate, you should see the following TRACE message in the
logs:
Backpressure 433
Mastering Apache Spark
The motivation is to control the number of executors required to process input records when
their number increases to the point when the processing time could become longer than the
batch interval.
Configuration
spark.streaming.dynamicAllocation.enabled controls whether to enabled dynamic
Settings
The following list are the settings used to configure Spark Streaming applications.
i.e. adds its value to checkpoint time, when used with the clock being a subclass of
org.apache.spark.util.ManualClock . It is used when JobGenerator is restarted from
checkpoint.
Settings 435
Mastering Apache Spark
Checkpointing
spark.streaming.checkpoint.directory - when set and StreamingContext is created, the
Back Pressure
spark.streaming.backpressure.enabled (default: false ) - enables ( true ) or disables
use.
Settings 436
Mastering Apache Spark
Spark SQL
From Spark SQL home page:
Spark SQL is Spark’s module for working with structured data (rows and columns) in
Spark.
From Spark’s Role in the Big Data Ecosystem - Matei Zaharia video:
Spark SQL is a distributed SQL engine designed to leverage the power of Spark’s
computation model (based on RDD).
Spark SQL comes with a uniform interface for data access in distributed storage systems
like Cassandra or HDFS (Hive, Parquet, JSON) using specialized DataFrameReader and
DataFrameWriter objects.
Spark SQL allows you to execute SQL-like queries on large volume of data that can live in
Hadoop HDFS or Hadoop-compatible file systems like S3. It can access data from different
data sources - files or tables.
1. Dataset with a strongly-typed LINQ-like Query DSL that Scala programmers will likely
find very appealing to use.
2. Non-programmers will likely use SQL as their query language through direct integration
with Hive
3. JDBC/ODBC fans can use JDBC interface (through Thrift JDBC/ODBC server) and
connect their tools to Spark’s distributed query engine.
Built-in functions or User-Defined Functions (UDFs) that take values from a single row
as input to generate a single return value for every input row.
Aggregate functions that operate on a group of rows and calculate a single return value
per group.
DataFrame
Spark SQL introduces a tabular data abstraction called DataFrame. It is designed to ease
processing large amount of structured tabular data on Spark infrastructure.
Found the following note about Apache Drill, but appears to apply to Spark SQL
perfectly:
A SQL query engine for relational and NoSQL databases with direct
Note
queries on self-describing and semi-structured data in files, e.g. JSON or
Parquet, and HBase tables without needing to specify metadata definitions
in a centralized store.
From user@spark:
If you already loaded csv data into a dataframe, why not register it as a table, and use
Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name)
FROM dftable_name … seems natural.
you’re more comfortable with SQL, it might worth registering this DataFrame as a table
and generating SQL query to it (generate a string with a series of min-max calls)
You can parse data from external data sources and let the schema inferencer to deduct the
schema.
Creating DataFrames
From http://stackoverflow.com/a/32514683/1305344:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
sqlContext.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
The result:
+-----------+--------+
|date_string| dow|
+-----------+--------+
| 08/11/2015| Tuesday|
| 09/11/2015| Friday|
| 09/12/2015|Saturday|
+-----------+--------+
And then…
import org.apache.spark.sql.SaveMode
val df = sqlContext.read.format("com.databricks.spark.avro").load("test.avro")
df.show
val df = sc.parallelize(Seq(
(1441637160, 10.0),
(1441637170, 20.0),
(1441637180, 30.0),
(1441637210, 40.0),
(1441637220, 10.0),
(1441637230, 0.0))).toDF("timestamp", "value")
import org.apache.spark.sql.types._
df.groupBy(tsGroup).agg(mean($"value").alias("value")).show
+----------+-----+
| timestamp|value|
+----------+-----+
|1441637160| 25.0|
|1441637220| 5.0|
+----------+-----+
More examples
Another example:
SQLContext
SQLContext is the entry point for Spark SQL. Whatever you do in Spark SQL it has to start
from creating an instance of SQLContext.
Creating Datasets
Creating DataFrames
Accessing DataFrameReader
Accessing ContinuousQueryManager
SQLContext(sc: SparkContext)
SQLContext.getOrCreate(sc: SparkContext)
You can get the current value of a configuration property by key using:
Note Properties that start with spark.sql are reserved for Spark SQL.
Creating DataFrames
emptyDataFrame
emptyDataFrame: DataFrame
This variant of createDataFrame creates a DataFrame from RDD of Row and explicit
schema.
udf: UDFRegistration
Functions registered using udf are available for Hive queries only.
// Create a DataFrame
val df = Seq("hello", "world!").zip(0 to 1).toDF("text", "id")
not. It simply requests CacheManager for CachedData and when exists, it assumes the table
is cached.
uncacheTable(tableName: String)
clearCache(): Unit
Implicits - SQLContext.implicits
The implicits object is a helper class with methods to convert objects into Datasets and
DataFrames, and also comes with many Encoders for "primitive" types as well as the
collections thereof.
It holds Encoders for Scala "primitive" types like Int , Double , String , and their
collections.
It offers support for creating Dataset from RDD of any types (for which an Encoder exists in
scope), or case classes or tuples, and Seq .
It also offers conversions from RDD or Seq of Product types (e.g. case classes or tuples)
to DataFrame . It has direct conversions from RDD of Int , Long and String to
DataFrame with a single column name _1 .
Creating Datasets
Accessing DataFrameReader
read: DataFrameReader
The experimental read method returns a DataFrameReader that is used to read data from
external storage systems and load it into a DataFrame .
It assumes parquet as the default data source format that you can change using
spark.sql.sources.default setting.
The range family of methods creates a Dataset[Long] with the sole id column of
LongType for given start , end , and step .
scala> sqlContext.range(5)
res0: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> sqlContext.range(5).show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
tables(): DataFrame
tables(databaseName: String): DataFrame
table methods return a DataFrame that holds names of existing tables in a database.
scala> sqlContext.tables.show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| t| true|
| t2| true|
+---------+-----------+
tableNames(): Array[String]
tableNames(databaseName: String): Array[String]
tableNames are similar to tables with the only difference that they return Array[String]
Accessing ContinuousQueryManager
streams: ContinuousQueryManager
Caution FIXME
SQLContext.getOrCreate method returns an active SQLContext object for the JVM or creates
Interestingly, there are two helper methods to set and clear the active SQLContext object -
setActive and clearActive respectively.
scala> sql("CREATE temporary table t2 USING PARQUET OPTIONS (PATH 'hello') AS SELECT * FROM t")
16/04/14 23:34:38 INFO HiveSqlParser: Parsing command: CREATE temporary table t2 USING PARQUET OPTION
scala> sqlContext.tables.show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| t| true|
| t2| true|
+---------+-----------+
sql parses sqlText using a dialect that can be set up using spark.sql.dialect setting.
Tip You may also use spark-sql shell script to interact with Hive.
Enable INFO logging level for the loggers that correspond to the
implementations of AbstractSqlParser to see what happens inside sql .
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.hive.execution.HiveSqlParser=INFO
Refer to Logging.
newSession(): SQLContext
You can use newSession method to create a new session without a cost of instantiating a
new SqlContext from scratch.
Dataset
Dataset offers convenience of RDDs with the performance optimizations of DataFrames and
the strong static type-safety of Scala. The last feature of bringing the strong type-safety to
DataFrame makes Dataset so appealing. All the features together give you a more
functional programming interface to work with structured data.
It is only with Datasets to have syntax and analysis checks at compile time (that is not
possible using DataFrame, regular SQL queries or even RDDs).
Using Dataset objects turns DataFrames of Row instances into a DataFrames of case
classes with proper names and types (following their equivalents in the case classes).
Instead of using indices to access respective fields in a DataFrame and cast it to a type, all
this is automatically handled by Datasets and checked by the Scala compiler.
Dataset 451
Mastering Apache Spark
You can convert a Dataset to a DataFrame (using toDF() method) or an RDD (using rdd
method).
import sqlContext.implicits._
scala> ds.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> ds.printSchema
root
|-- name: string (nullable = true)
|-- productId: integer (nullable = false)
|-- score: double (nullable = false)
Dataset 452
Mastering Apache Spark
scala> ds.map(_.name).show
+-----+
|value|
+-----+
| aaa|
| aaa|
| bbb|
| bbb|
+-----+
Type-safety as Datasets are Scala domain objects and operations operate on their
attributes. All is checked by the Scala compiler at compile time.
toDS(): Dataset[T]
toDF(): DataFrame
object of SQLContext.
Dataset 453
Mastering Apache Spark
scala> Seq("hello").toDS
res0: org.apache.spark.sql.Dataset[String] = [value: string]
scala> Seq("hello").toDF
res1: org.apache.spark.sql.DataFrame = [value: string]
scala> Seq("hello").toDF("text")
res2: org.apache.spark.sql.DataFrame = [text: string]
scala> sc.parallelize(Seq("hello")).toDS
res3: org.apache.spark.sql.Dataset[String] = [value: string]
scala> sc.version
res11: String = 2.0.0-SNAPSHOT
Note
scala> :imports
1) import sqlContext.implicits._ (52 terms, 31 are implicit)
2) import sqlContext.sql (1 terms)
Schema
A Dataset has a schema that is available as schema .
You may also use the following methods to learn about the schema:
printSchema(): Unit
explain(): Unit
Caution FIXME
Dataset 454
Mastering Apache Spark
Supported Types
Encoder
Caution FIXME
An Encoder object is used to convert your domain object (a JVM object) into Spark’s
internal representation. It allows for significantly faster serialization and deserialization
(comparing to the default Java serializer).
Note SQLContext.implicits object comes with Encoders for many types in Scala.
Encoders map columns (of your dataset) to fields (of your JVM object) by name. It is by
Encoders that you can bridge JVM objects to data sources (CSV, JDBC, Parquet, Avro,
JSON, Cassandra, Elasticsearch, memsql) and vice versa.
toJSON
toJSON maps the content of Dataset to a Dataset of JSON strings.
scala> ds.toJSON.show
+-------------------+
| value|
+-------------------+
| {"value":"hello"}|
| {"value":"world"}|
|{"value":"foo bar"}|
+-------------------+
explain
Dataset 455
Mastering Apache Spark
explain(): Unit
explain(extended: Boolean): Unit
explain prints the logical and physical plans to the console. You can use it for debugging.
val ds = sqlContext.range(10)
== Physical Plan ==
WholeStageCodegen
: +- Range 0, 1, 8, 10, [id#9L]
selectExpr
val ds = sqlContext.range(5)
Internally, it executes select with every expression in exprs mapped to Column (using
SparkSqlParser.parseExpression).
Dataset 456
Mastering Apache Spark
isStreaming
isStreaming returns true when Dataset contains StreamingRelation or
randomSplit
You can define seed and if you don’t, a random seed will be used.
Dataset 457
Mastering Apache Spark
val ds = sqlContext.range(10)
scala> ds.randomSplit(Array[Double](2, 3)).foreach(_.show)
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
+---+
| id|
+---+
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
QueryExecution
Caution FIXME
val ds = sqlContext.range(5)
scala> ds.queryExecution
res17: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
Range 0, 5, 1, 8, [id#39L]
== Physical Plan ==
WholeStageCodegen
: +- Range 0, 1, 8, 5, [id#39L]
Queryable
Dataset 458
Mastering Apache Spark
Caution FIXME
Dataset 459
Mastering Apache Spark
Columns
Caution FIXME
scala> df.select('id)
res7: org.apache.spark.sql.DataFrame = [id: int]
scala> df.select('id).show
+---+
| id|
+---+
| 0|
| 1|
+---+
over function
over function defines a windowing column that allows for window computations to be
cast
cast method casts a column to a type. It makes for type-safe maps with Row objects of the
cast Example
Dataset 460
Mastering Apache Spark
scala> df.printSchema
root
|-- label: float (nullable = false)
|-- text: string (nullable = true)
// without cast
import org.apache.spark.sql.Row
scala> df.select("label").map { case Row(label) => label.getClass.getName }.show(false)
+---------------+
|value |
+---------------+
|java.lang.Float|
+---------------+
// with cast
import org.apache.spark.sql.types.DoubleType
scala> df.select(col("label").cast(DoubleType)).map { case Row(label) => label.getClass.getName }.sho
+----------------+
|value |
+----------------+
|java.lang.Double|
+----------------+
Dataset 461
Mastering Apache Spark
Schema
Caution FIXME Descibe me!
scala> df.printSchema
root
|-- label: integer (nullable = false)
|-- sentence: string (nullable = true)
scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false
scala> df.schema("label").dataType
res1: org.apache.spark.sql.types.DataType = IntegerType
Dataset 462
Mastering Apache Spark
See org.apache.spark.package.scala.
A DataFrame is a distributed collection of tabular data organized into rows and named
columns. It is conceptually equivalent to a table in a relational database and provides
operations to project ( select ), filter , intersect , join , group , sort , join ,
aggregate , or convert to a RDD (consult DataFrame API)
data.groupBy('Product_ID).sum('Score)
Spark SQL borrowed the concept of DataFrame from pandas' DataFrame and made it
immutable, parallel (one machine, perhaps with many processors and cores) and
distributed (many machines, perhaps with many processors and cores).
Hey, big data consultants, time to help teams migrate the code from pandas'
Note DataFrame into Spark’s DataFrames (at least to PySpark’s DataFrame) and
offer services to set up large clusters!
DataFrames in Spark SQL strongly rely on the features of RDD - it’s basically a RDD
exposed as DataFrame by appropriate operations to handle very big data from the day one.
So, petabytes of data should not scare you (unless you’re an administrator to create such
clustered Spark environment - contact me when you feel alone with the task).
Dataset 463
Mastering Apache Spark
scala> df.show
+----+-----+
|word|count|
+----+-----+
| one| 1|
| one| 1|
| two| 1|
+----+-----+
scala> counted.show
+----+-----+
|word|count|
+----+-----+
| two| 1|
| one| 2|
+----+-----+
You can create DataFrames by loading data from structured files (JSON, Parquet, CSV),
RDDs, tables in Hive, or external databases (JDBC). You can also create DataFrames from
scratch and build upon them. See DataFrame API. You can read any format given you have
appropriate Spark SQL extension of DataFrameReader.
the good ol' SQL - helps migrating from SQL databases into the world of DataFrame in
Spark SQL
Query DSL - an API that helps ensuring proper syntax at compile time.
Filtering
DataFrames use the Catalyst query optimizer to produce more efficient queries (and so they
are supposed to be faster than corresponding RDD-based queries).
Your DataFrames can also be type-safe and moreover further improve their
Note performance through specialized encoders that significantly cut serialization
and deserialization times.
Features of DataFrame
Dataset 464
Mastering Apache Spark
A DataFrame is a collection of "generic" Row instances (as RDD[Row] ) and a schema (as
StructType ).
A schema describes the columns and for each column it defines the name, the type and
whether or not it accepts empty values.
StructType
Caution FIXME
method.
withColumn method returns a new DataFrame with the new column col with colName
name added.
Dataset 465
Mastering Apache Spark
scala> df.show
+------+------+
|number|polish|
+------+------+
| 1| jeden|
| 2| dwa|
+------+------+
Caution FIXME
The quickest and easiest way to work with Spark SQL is to use Spark shell and sqlContext
object.
scala> sqlContext
res1: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@60ae950f
The Apache Hive™ data warehouse software facilitates querying and managing large
datasets residing in distributed storage.
Dataset 466
Mastering Apache Spark
Using toDF
After you import sqlContext.implicits._ (which is done for you by Spark shell) you may
apply toDF method to convert objects to DataFrames.
scala> df.show
+------+---+
| name|age|
+------+---+
| Jacek| 42|
|Patryk| 19|
|Maksym| 5|
+------+---+
SQLContext.emptyDataFrame
Spark SQL 1.3 offers SQLContext.emptyDataFrame operation to create an empty
DataFrame.
Dataset 467
Mastering Apache Spark
scala> adf.registerTempTable("t")
scala> auctions.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: string (nullable = true)
|-- bidtime: string (nullable = true)
Dataset 468
Mastering Apache Spark
scala> auctions.dtypes
res28: Array[(String, String)] = Array((auctionid,StringType), (bid,StringType), (bidtime,StringType)
scala> auctions.show(5)
+----------+----+-----------+-----------+----------+-------+-----+
| auctionid| bid| bidtime| bidder|bidderrate|openbid|price|
+----------+----+-----------+-----------+----------+-------+-----+
|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|
|1638843936| 800|0.826388889| doc213| 60| 500| 1625|
|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|
|1638843936|1500|5.226377315|carloss8055| 5| 500| 1625|
|1638843936|1600| 6.570625| jdrinaz| 6| 500| 1625|
+----------+----+-----------+-----------+----------+-------+-----+
only showing top 5 rows
scala> lines.count
res3: Long = 1349
scala> case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate:
defined class Auction
scala> df.printSchema
Dataset 469
Mastering Apache Spark
root
|-- auctionid: string (nullable = true)
|-- bid: float (nullable = false)
|-- bidtime: float (nullable = false)
|-- bidder: string (nullable = true)
|-- bidderrate: integer (nullable = false)
|-- openbid: float (nullable = false)
|-- price: float (nullable = false)
scala> df.show
+----------+------+----------+-----------------+----------+-------+------+
| auctionid| bid| bidtime| bidder|bidderrate|openbid| price|
+----------+------+----------+-----------------+----------+-------+------+
|1638843936| 500.0|0.47836804| kona-java| 181| 500.0|1625.0|
|1638843936| 800.0| 0.8263889| doc213| 60| 500.0|1625.0|
|1638843936| 600.0| 3.7611227| zmxu| 7| 500.0|1625.0|
|1638843936|1500.0| 5.2263775| carloss8055| 5| 500.0|1625.0|
|1638843936|1600.0| 6.570625| jdrinaz| 6| 500.0|1625.0|
|1638843936|1550.0| 6.8929167| carloss8055| 5| 500.0|1625.0|
|1638843936|1625.0| 6.8931136| carloss8055| 5| 500.0|1625.0|
|1638844284| 225.0| 1.237419|dre_313@yahoo.com| 0| 200.0| 500.0|
|1638844284| 500.0| 1.2524074| njbirdmom| 33| 200.0| 500.0|
|1638844464| 300.0| 1.8111342| aprefer| 58| 300.0| 740.0|
|1638844464| 305.0| 3.2126737| 19750926o| 3| 300.0| 740.0|
|1638844464| 450.0| 4.1657987| coharley| 30| 300.0| 740.0|
|1638844464| 450.0| 6.7363195| adammurry| 5| 300.0| 740.0|
|1638844464| 500.0| 6.7364697| adammurry| 5| 300.0| 740.0|
|1638844464|505.78| 6.9881945| 19750926o| 3| 300.0| 740.0|
|1638844464| 551.0| 6.9896526| 19750926o| 3| 300.0| 740.0|
|1638844464| 570.0| 6.9931483| 19750926o| 3| 300.0| 740.0|
|1638844464| 601.0| 6.9939003| 19750926o| 3| 300.0| 740.0|
|1638844464| 610.0| 6.994965| 19750926o| 3| 300.0| 740.0|
|1638844464| 560.0| 6.9953704| ps138| 5| 300.0| 740.0|
+----------+------+----------+-----------------+----------+-------+------+
only showing top 20 rows
Support for CSV data sources is available by default in Spark 2.0.0. No need for
Note
an external module.
Dataset 470
Mastering Apache Spark
scala> df.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: string (nullable = true)
|-- bidtime: string (nullable = true)
|-- bidder: string (nullable = true)
|-- bidderrate: string (nullable = true)
|-- openbid: string (nullable = true)
|-- price: string (nullable = true)
scala> df.show
+----------+------+-----------+-----------------+----------+-------+-----+
| auctionid| bid| bidtime| bidder|bidderrate|openbid|price|
+----------+------+-----------+-----------------+----------+-------+-----+
|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|
|1638843936| 800|0.826388889| doc213| 60| 500| 1625|
|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|
|1638843936| 1500|5.226377315| carloss8055| 5| 500| 1625|
|1638843936| 1600| 6.570625| jdrinaz| 6| 500| 1625|
|1638843936| 1550|6.892916667| carloss8055| 5| 500| 1625|
|1638843936| 1625|6.893113426| carloss8055| 5| 500| 1625|
|1638844284| 225|1.237418982|dre_313@yahoo.com| 0| 200| 500|
|1638844284| 500|1.252407407| njbirdmom| 33| 200| 500|
|1638844464| 300|1.811134259| aprefer| 58| 300| 740|
|1638844464| 305|3.212673611| 19750926o| 3| 300| 740|
|1638844464| 450|4.165798611| coharley| 30| 300| 740|
|1638844464| 450|6.736319444| adammurry| 5| 300| 740|
|1638844464| 500|6.736469907| adammurry| 5| 300| 740|
|1638844464|505.78|6.988194444| 19750926o| 3| 300| 740|
|1638844464| 551|6.989652778| 19750926o| 3| 300| 740|
|1638844464| 570|6.993148148| 19750926o| 3| 300| 740|
|1638844464| 601|6.993900463| 19750926o| 3| 300| 740|
|1638844464| 610|6.994965278| 19750926o| 3| 300| 740|
|1638844464| 560| 6.99537037| ps138| 5| 300| 740|
+----------+------+-----------+-----------------+----------+-------+-----+
only showing top 20 rows
Dataset 471
Mastering Apache Spark
You can create DataFrames by loading data from structured files (JSON, Parquet, CSV),
RDDs, tables in Hive, or external databases (JDBC) using SQLContext.read method.
read: DataFrameReader
Among the supported structured data (file) formats are (consult Specifying Data Format
(format method) for DataFrameReader ):
JSON
parquet
JDBC
ORC
libsvm
reader.parquet("file.parquet")
reader.json("file.json")
reader.format("libsvm").load("sample_libsvm_data.txt")
Querying DataFrame
This variant (in which you use stringified column names) can only select existing
Note
columns, i.e. you cannot create new ones using select expressions.
Dataset 472
Mastering Apache Spark
scala> predictions.printSchema
root
|-- id: long (nullable = false)
|-- topic: string (nullable = true)
|-- text: string (nullable = true)
|-- label: double (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
scala> auctions.groupBy("bidder").count().show(5)
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| nguyenat@san.rr.com| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
+--------------------+-----+
only showing top 5 rows
In the following example you query for the top 5 of the most active bidders.
Note the tiny $ and desc together with the column name to sort the rows by.
Dataset 473
Mastering Apache Spark
scala> auctions.groupBy("bidder").count().sort($"count".desc).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows
scala> auctions.groupBy("bidder").count().sort(desc("count")).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows
Dataset 474
Mastering Apache Spark
scala> df.select("auctionid").distinct.count
res88: Long = 97
scala> df.groupBy("bidder").count.show
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| nguyenat@san.rr.com| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
| shetellia@aol.com| 1|
| rrolex| 1|
| bupper99| 2|
| cheddaboy| 2|
| adcc007| 1|
| varvara_b| 1|
| yokarine| 4|
| steven1328| 1|
| anjara| 2|
| roysco| 1|
|lennonjasonmia@ne...| 2|
|northwestportland...| 4|
| bosspad| 10|
| 31strawberry| 6|
| nana-tyler| 11|
+--------------------+-----+
only showing top 20 rows
Using SQL
Register a DataFrame as a named temporary table to run SQL.
You can execute a SQL query on a DataFrame using sql operation, but before the query is
executed it is optimized by Catalyst query optimizer. You can print the physical plan for a
DataFrame using the explain operation.
Dataset 475
Mastering Apache Spark
scala> sql.explain
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#148L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCoun
TungstenProject
Scan PhysicalRDD[auctionid#49,bid#50,bidtime#51,bidder#52,bidderrate#53,openbid#54,price#55]
scala> sql.show
+-----+
|count|
+-----+
| 1348|
+-----+
Filtering
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> df.filter($"name".like("a%")).show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
+----+---------+-----+
DataFrame.explain
When performance is the issue you should use DataFrame.explain(true) .
Example Datasets
Dataset 476
Mastering Apache Spark
Dataset 477
Mastering Apache Spark
Row
Row is a data abstraction of an ordered collection of fields that can be accessed by an
ordinal / an index (aka generic access by ordinal), a name (aka native primitive access) or
using Scala’s pattern matching. A Row instance may or may not have a schema.
import org.apache.spark.sql.Row
Field Access
Fields of a Row instance can be accessed by index (starting from 0 ) using apply or
get .
scala> row(1)
res0: Any = hello
scala> row.get(1)
res1: Any = hello
Note Generic access by ordinal (using apply or get ) returns a value of type Any .
You can query for fields with their proper types using getAs with an index
scala> row.getAs[Int](0)
res1: Int = 1
scala> row.getAs[String](1)
res2: String = hello
Dataset 478
Mastering Apache Spark
FIXME
Note row.getAs[String](null)
Schema
A Row instance can have a schema defined.
Unless you are instantiating Row yourself (using Row Object), a Row has
Note
always a schema.
Row Object
Row companion object offers factory methods to create Row instances from a collection of
Dataset 479
Mastering Apache Spark
Dataset 480
Mastering Apache Spark
Logical Plan
Caution FIXME
It can be analyzed or not, i.e. it has already been gone through analysis and verification.
LogicalPlan knows the size of objects that are results of SQL operators, like join through
Statistics object.
INNER
LEFT OUTER
RIGHT OUTER
FULL OUTER
LEFT SEMI
NATURAL
Dataset 481
Mastering Apache Spark
DataFrameReader
DataFrameReader is an interface to return DataFrame from many storage formats in external
import org.apache.spark.sql.DataFrameReader
val reader: DataFrameReader = sqlContext.read
It has a direct support for many file formats and interface for new ones. It assumes parquet
as the default data source format that you can change using spark.sql.sources.default
setting.
json
parquet
orc
text
jdbc
Dataset 482
Mastering Apache Spark
You can also use options method to describe different options in a single Map .
load methods
load(): DataFrame
load(path: String): DataFrame
stream methods
stream(): DataFrame
stream(path: String): DataFrame
Dataset 483
Mastering Apache Spark
JSON
CSV
parquet
ORC
text
json method
csv method
parquet method
New in 2.0.0: snappy is the default Parquet codec. See [SPARK-14482][SQL] Change
default Parquet codec from gzip to snappy.
none or uncompressed
Dataset 484
Mastering Apache Spark
lzo
orc method
Dataset 485
Mastering Apache Spark
Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store
Hive data with more than 1,000 columns and improve performance. ORC format was
introduced in Hive version 0.11 to use and retain the type information from the table
definition.
Tip Read ORC Files document to learn about the ORC file format.
text method
text method loads a text file.
Example
scala> lines.show
+--------------------+
| value|
+--------------------+
| # Apache Spark|
| |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Spark Streami...|
| |
|<http://spark.apa...|
| |
| |
|## Online Documen...|
| |
|You can find the ...|
|guide, on the [pr...|
|and [project wiki...|
|This README file ...|
| |
| ## Building Spark|
+--------------------+
only showing top 20 rows
Dataset 486
Mastering Apache Spark
table method
scala> sqlContext.read.table("dafa").show(false)
+---+-------+
|id |text |
+---+-------+
|1 |swiecie|
|0 |hello |
+---+-------+
jdbc method
Dataset 487
Mastering Apache Spark
jdbc allows you to create DataFrame that represents table in the database available as
url .
Dataset 488
Mastering Apache Spark
DataFrameWriter
DataFrameWriter is used to write DataFrame to external storage systems or data streams.
import org.apache.spark.sql.DataFrameWriter
val df = ...
val writer: DataFrameWriter = df.write
It has a direct support for many file formats and interface for new ones. It assumes parquet
as the default data source format that you can change using spark.sql.sources.default
setting.
Caution FIXME
Parquet
Caution FIXME
startStream(): ContinuousQuery
startStream(path: String): ContinuousQuery
Dataset 489
Mastering Apache Spark
MemorySink
MemorySink is an internal sink exposed for testing. It is available as memory format that
It was introduced in the pull request for [SPARK-14288][SQL] Memory Sink for
Note
streaming.
Its aim is to allow users to test streaming applications in the Spark shell or other local tests.
It creates MemorySink instance based on the schema of the DataFrame it operates on.
It creates a new DataFrame using MemoryPlan with MemorySink instance created earlier
and registers it as a temporary table (using DataFrame.registerTempTable method).
At this point you can query the table as if it were a regular non-streaming table
Note
using sql method.
Dataset 490
Mastering Apache Spark
DataSource
DataSource case class belongs to the Data Source API (with DataFrameReader and
DataFrameWriter).
Dataset 491
Mastering Apache Spark
in DataFrames.
Note The functions object is an experimental feature of Spark since version 1.3.0.
You can access the functions using the following import statement:
import org.apache.spark.sql.functions._
There are nearly 50 or more functions in the functions object. Some functions are
transformations of Column objects (or column names) into other Column objects or
transform DataFrame into DataFrame .
Defining UDFs
String functions
split
Aggregate functions
struct
expr
…and others
Tip You should read the official documentation of the functions object.
Dataset 492
Mastering Apache Spark
The udf family of functions allows you to create user-defined functions (UDFs) based on a
user-defined function in Scala. It accepts f function of 0 to 10 arguments and the input and
output types are automatically inferred (given the types of the respective input and output
types of the function f ).
import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)
// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")
udf(f: AnyRef, dataType: DataType) allows you to use a Scala closure for the function
argument (as f ) and explicitly declaring the output data type (as dataType ).
import org.apache.spark.sql.types.IntegerType
val byTwo = udf((n: Int) => n * 2, IntegerType)
String functions
Dataset 493
Mastering Apache Spark
split function
split function splits str column using pattern . It returns a new Column .
scala> withSplit.show
+---+-------------+----------------+
|num| input| split|
+---+-------------+----------------+
| 0| hello|world| [hello, world]|
| 1|witaj|swiecie|[witaj, swiecie]|
+---+-------------+----------------+
Note .$|()[{^?*+\ are RegEx’s meta characters and are considered special.
upper function
upper function converts a string column into one with all letter upper. It returns a new
Column .
The following example uses two functions that accept a Column and return
Note
another to showcase how to chain them.
scala> withUpperReversed.show
+---+---+-----+-----+
| id|val| name|upper|
+---+---+-----+-----+
| 0| 1|hello|OLLEH|
| 2| 3|world|DLROW|
| 2| 4| ala| ALA|
+---+---+-----+-----+
Dataset 494
Mastering Apache Spark
Non-aggregate functions
They are also called normal functions.
struct functions
struct family of functions allows you to create a new struct column based on a collection of
The difference between struct and another similar array function is that the
Note
types of the columns can be different (in struct ).
broadcast function
broadcast function creates a new DataFrame (out of the input DataFrame ) and marks the
Dataset 495
Mastering Apache Spark
expr function
scala> ds.show
+------+---+
| text| id|
+------+---+
| hello| 0|
|world!| 1|
+------+---+
Dataset 496
Mastering Apache Spark
Aggregation (GroupedData)
Executing aggregation on DataFrames by means of groupBy is still an
Note
experimental feature. It is available since Apache Spark 1.3.0.
You can use DataFrame to compute aggregates over a collection of (grouped) rows.
groupBy
rollup
cube
groupBy Operator
The following session uses the data setup as described in Test Setup section
Note
below.
Dataset 497
Mastering Apache Spark
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> df.groupBy("name").count.show
+----+-----+
|name|count|
+----+-----+
| aaa| 2|
| bbb| 2|
+----+-----+
scala> df.groupBy("name").max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa| 0.29|
| bbb| 0.53|
+----+----------+
scala> df.groupBy("name").sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa| 0.41|
| bbb| 0.95|
+----+----------+
scala> df.groupBy("productId").sum("score").show
+---------+------------------+
|productId| sum(score)|
+---------+------------------+
| 300| 0.42|
| 100| 0.12|
| 200|0.8200000000000001|
+---------+------------------+
GroupedData
GroupedData is a result of executing
agg
Dataset 498
Mastering Apache Spark
count
mean
max
avg
min
sum
pivot
Test Setup
This is a setup for learning GroupedData . Paste it into Spark Shell using :paste .
import sqlContext.implicits._
1. Cache the DataFrame so following queries won’t load data over and over again.
Dataset 499
Mastering Apache Spark
UDFs — User-Defined Functions
Caution FIXME
Dataset 500
Mastering Apache Spark
Spark SQL supports three kinds of window aggregate function: ranking functions, analytic
functions, and aggregate functions.
Window functions are also called over functions due to how they are applied
Note
using Column’s over function.
Although similar to aggregate functions, a window function does not group rows into a single
output row and retains their separate identities. A window function can access rows that are
linked to the current row.
ranking functions
analytic functions
aggregate functions
Dataset 501
Mastering Apache Spark
RANK rank
DENSE_RANK dense_rank
NTILE ntile
ROW_NUMBER row_number
CUME_DIST cume_dist
LEAD lead
For aggregate functions, you can use the existing aggregate functions as window functions,
e.g. sum , avg , min , max and count .
You can mark a function window by OVER clause after a function in SQL, e.g. avg(revenue)
OVER (…) or over method on a function in the Dataset API, e.g. rank().over(…) .
When executed, a window function computes a value for each row in a window.
Note Window functions belong to Window functions group in Spark’s Scala API.
A window specification defines which rows are included in a window (aka a frame), i.e. set
of rows, that is associated with a given input row. It does so by partitioning an entire data
set and specifying frame boundary with ordering.
Dataset 502
Mastering Apache Spark
import org.apache.spark.sql.expressions.Window
1. Partitioning Specification defines which rows will be in the same partition for a given
row.
2. Ordering Specification defines how rows in a partition are ordered, determining the
position of the given row in its partition.
3. Frame Specification (unsupported in Hive; see Why do Window functions fail with
"Window function X does not take a frame specification"?) defines the rows to be
included in the frame for the current input row, based on their relative position to the
current row. For example, “the three rows preceding the current row to the current row”
describes a frame including the current input row and three rows appearing before the
current row.
Once WindowSpec instance has been created using Window object, you can further expand
on window specification using the following methods to define frames:
Besides the two above, you can also use the following methods (that correspond to the
methods in Window object):
partitionBy
orderBy
Window object
Window object provides functions to define windows (as WindowSpec instances).
functions.
import org.apache.spark.sql.expressions.Window
Dataset 503
Mastering Apache Spark
There are two families of the functions available in Window object that create WindowSpec
instance for one or many Column instances:
partitionBy
orderBy
partitionBy
partitionBy creates an instance of WindowSpec with partition expression(s) defined for one
or more columns.
scala> .show
+---+-----+-----------------+
| id|token|sum over h tokens|
+---+-----+-----------------+
| 0|hello| 4|
| 1|henry| 4|
| 2| and| 2|
| 3|harry| 4|
+---+-----+-----------------+
orderBy
Window Examples
Two samples from org.apache.spark.sql.expressions.Window scaladoc:
Dataset 504
Mastering Apache Spark
// PARTITION BY country ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.partitionBy('country).orderBy('date).rowsBetween(Long.MinValue, 0)
Frame
At its core, a window function calculates a return value for every input row of a table based
on a group of rows, called the frame. Every input row can have a unique frame associated
with it.
When you define a frame you have to specify three components of a frame specification -
the start and end boundaries, and the type.
CURRENT ROW
<value> PRECEDING
<value> FOLLOWING
Types of frames:
ROW - based on physical offsets from the position of the current input row
RANGE - based on logical offsets from the position of the current input row
In the current implementation of WindowSpec you can use two methods to define a frame:
rowsBetween
rangeBetween
Examples
Dataset 505
Mastering Apache Spark
Top N per Group is useful when you need to compute the first and second best-sellers in
category.
Question: What are the best-selling and the second best-selling products in every category?
Dataset 506
Mastering Apache Spark
scala> dataset.show
+----------+----------+-------+
| product| category|revenue|
+----------+----------+-------+
| Thin|cell phone| 6000|
| Normal| tablet| 1500|
| Mini| tablet| 5500|
|Ultra thin|cell phone| 5000|
| Very thin|cell phone| 6000|
| Big| tablet| 2500|
| Bendable|cell phone| 3000|
| Foldable|cell phone| 3000|
| Pro| tablet| 4500|
| Pro2| tablet| 6500|
+----------+----------+-------+
The question boils down to ranking products in a category based on their revenue, and to
pick the best selling and the second best-selling products based the ranking.
Dataset 507
Mastering Apache Spark
import org.apache.spark.sql.expressions.Window
val overCategory = Window.partitionBy('category).orderBy('revenue.desc)
val rank = dense_rank.over(overCategory)
scala> ranked.show
+----------+----------+-------+----+
| product| category|revenue|rank|
+----------+----------+-------+----+
| Pro2| tablet| 6500| 1|
| Mini| tablet| 5500| 2|
| Pro| tablet| 4500| 3|
| Big| tablet| 2500| 4|
| Normal| tablet| 1500| 5|
| Thin|cell phone| 6000| 1|
| Very thin|cell phone| 6000| 1|
|Ultra thin|cell phone| 5000| 2|
| Bendable|cell phone| 3000| 3|
| Foldable|cell phone| 3000| 3|
+----------+----------+-------+----+
This example is the 2nd example from an excellent article Introducing Window
Note
Functions in Spark SQL.
Dataset 508
Mastering Apache Spark
import org.apache.spark.sql.expressions.Window
val reveDesc = Window.partitionBy('category).orderBy('revenue.desc)
val reveDiff = max('revenue).over(reveDesc) - 'revenue
Difference on Column
Compute a difference between values in rows in a column.
Dataset 509
Mastering Apache Spark
scala> ds.show
+---+----+
| ns|tens|
+---+----+
| 1| 10|
| 1| 20|
| 2| 20|
| 2| 40|
| 3| 30|
| 3| 60|
| 4| 40|
| 4| 80|
| 5| 50|
| 5| 100|
+---+----+
import org.apache.spark.sql.expressions.Window
val overNs = Window.partitionBy('ns).orderBy('tens)
val diff = lead('tens, 1).over(overNs)
Please note that Why do Window functions fail with "Window function X does not take a
frame specification"?
The key here is to remember that DataFrames are RDDs under the covers and hence
aggregation like grouping by a key in DataFrames is RDD’s groupBy (or worse,
reduceByKey or aggregateByKey transformations).
Dataset 510
Mastering Apache Spark
Running Total
The running total is the sum of all previous lines including the current one.
scala> sales.show
+---+-------+------+--------+
| id|orderID|prodID|orderQty|
+---+-------+------+--------+
| 0| 0| 0| 5|
| 1| 0| 1| 3|
| 2| 0| 2| 1|
| 3| 1| 0| 2|
| 4| 2| 0| 8|
| 5| 2| 2| 8|
+---+-------+------+--------+
scala> salesTotalQty.show
16/04/10 23:01:52 WARN Window: No Partition Defined for Window operation! Moving all data to a single
+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 11|
| 4| 2| 0| 8| 19|
| 5| 2| 2| 8| 27|
+---+-------+------+--------+-------------+
scala> salesTotalQtyPerOrder.show
+---+-------+------+--------+-----------------------+
| id|orderID|prodID|orderQty|running_total_per_order|
+---+-------+------+--------+-----------------------+
| 0| 0| 0| 5| 5|
Dataset 511
Mastering Apache Spark
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 2|
| 4| 2| 0| 8| 8|
| 5| 2| 2| 8| 16|
+---+-------+------+--------+-----------------------+
With the Interval data type, you could use intervals as values specified in <value> PRECEDING
and <value> FOLLOWING for RANGE frame. It is specifically suited for time-series analysis with
window functions.
Moving Average
Cumulative Aggregates
Eg. cumulative sum
With the window function support, you could use user-defined aggregate functions as
window functions.
Dataset 512
Mastering Apache Spark
Window Functions
Dataset 513
Mastering Apache Spark
Note It is slated for Spark 2.0 at the end of April or beginning of May, 2016.
The feature has also been called Streaming Spark SQL Query or Streaming
Note
DataFrames or Continuous DataFrames or Continuous Queries.
ContinuousQueryManager
Note ContinuousQueryManager is an experimental feature of Spark 2.0.0.
You can access ContinuousQueryManager for the current session using SQLContext.streams
method. It is lazily created when a SQLContext instance starts.
startQuery
startQuery(
name: String,
checkpointLocation: String,
df: DataFrame,
sink: Sink,
trigger: Trigger = ProcessingTime(0)): ContinuousQuery
active: Array[ContinuousQuery]
SQLContext .
ContinuousQueryListener
Caution FIXME
ContinuousQueryListener is an interface for listening to query life cycle events, i.e. a query
ContinuousQuery
A ContinuousQuery has a name. It belongs to a single SQLContext .
It can be in two states: active (started) or inactive (stopped). If inactive, it may have
transitioned into the state due to an ContinuousQueryException (that is available under
exception ).
There could only be a single Sink for a ContinuousQuery with many `Source’s.
StreamExecution
StreamExecution manages execution of a streaming query for a SQLContext and a Sink . It
requires a LogicalPlan to know the Source objects from which records are periodically
pulled down.
Note The time between batches - 10 milliseconds - is fixed (i.e. not configurable).
Joins
Caution FIXME
You can use broadcast function to mark a DataFrame to be broadcast when used in a join
operator.
According to the article Map-Side Join in Spark, broadcast join is also called a replicated
join (in the distributed system community) or a map-side join (in the Hadoop community).
At long last! I have always been wondering what a map-side join is and it
Note
appears I am close to uncover the truth!
And later in the article Map-Side Join in Spark, you can find that with the broadcast join, you
can very effectively join a large table (fact) with relatively small tables (dimensions), i.e. to
perform a star-schema join you can avoid sending all data of the large table over the
network.
CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.
Currently statistics are only supported for Hive Metastore tables where the
Note
command ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan has been run.
Joins 518
Mastering Apache Spark
Hive Integration
Spark SQL supports Apache Hive using HiveContext . It uses the Spark SQL execution
engine to work with data stored in Hive.
Enable DEBUG logging level for HiveContext to see what happens inside.
Add the following line to conf/log4j.properties :
Tip log4j.logger.org.apache.spark.sql.hive.HiveContext=DEBUG
Refer to Logging.
Hive Functions
SQLContext.sql (or simply sql ) allows you to interact with Hive.
You can use show functions to learn about the Hive functions supported through the Hive
integration.
The default configuration uses Hive 1.2.1 with the default warehouse in
/user/hive/warehouse .
HiveSessionState
Caution FIXME
current_database function
current_database function returns the current database of Hive metadata.
Analyzing Tables
analyze(tableName: String)
analyze analyzes tableName table for query optimizations. It currently supports only Hive
tables.
scala> sqlContext.asInstanceOf[HiveContext].analyze("dafa")
16/04/09 14:02:56 INFO HiveSqlParser: Parsing command: dafa
java.lang.UnsupportedOperationException: Analyze only works for Hive tables, but dafa is a LogicalRel
at org.apache.spark.sql.hive.HiveContext.analyze(HiveContext.scala:304)
... 50 elided
Caution FIXME
Settings
spark.sql.hive.metastore.version (default: 1.2.1 ) - the version of the Hive metastore.
Read about Spark SQL CLI in Spark’s official documentation in Running the
Tip
Spark SQL CLI.
SQL Parsers
AbstractSqlParser abstract class provides the foundation of the SQL parsing infrastructure
in Spark SQL.
SparkSqlParser is…FIXME
Caching
Caution FIXME
You can use CACHE TABLE [tableName] to cache tableName table in memory. It is an eager
operation which is executed as soon as the statement is executed.
Caching 525
Mastering Apache Spark
Datasets vs RDDs
Many may have been asking yourself why they should be using Datasets rather than the
foundation of all Spark - RDDs using case classes.
In RDD, you have to do an additional hop over a case class and access fields by name.
RowEncoder
RowEncoder is a factory object that maps StructType to ExpressionEncoder[Row]
Predicate Pushdown
Caution FIXME
When you execute where right after you load a data set into a Dataset, Spark SQL
pushes where predicate down to the source using a corresponding SQL query with WHERE
(or whatever is the proper language for the source).
val df = sqlContext.read
.format("jdbc")
.option("url", "jdbc:...")
.option("dbtable", "people")
.load()
.where($"name" === "Jacek")
Query Plan
Caution FIXME
Attribute
Caution FIXME
Whole stage codegen compiles a subtree of plans that support codegen together into a
single Java function.
val df = sqlContext.range(5)
Project
Filter
It is used by some modern massively parallel processing (MPP) databases to archive great
performance. See Efficiently Compiling Efficient Query Plans for Modern Hardware (PDF).
CodegenSupport Trait
CodegenSupport is a trait for physical operators with support for codegen. CodegenSupport
extends SparkPlan .
performance.
Project Tungsten
code generation
memory management
Tungsten uses sun.misc.unsafe API for direct memory access that aims at bypassing the
JVM in order to avoid garbage collection.
Tungsten does code generation, i.e. generates JVM bytecode on the fly, to access
Tungsten-managed memory structures that gives a very fast access.
Settings
The following list are the settings used to configure Spark SQL applications.
sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
SQLContexts/HiveContexts is allowed.
maximum size in bytes for a table that will be broadcast to all worker nodes when
performing a join. If the size of the statistics of the logical plan of a DataFrame is at
most the setting, the DataFrame is broadcast for join.
spark.sql.columnNameOfCorruptRecord
spark.sql.dialect - FIXME
data for continuously executing queries. See Data Streams (startStream methods).
multiple operators) will be compiled into single java method ( true ) or not ( false ).
Settings 533
Mastering Apache Spark
Spark MLlib
I’m new to Machine Learning as a discipline and Spark MLlib in particular so
Caution
mistakes in this document are considered a norm (not an exception).
You can find the following types of machine learning algorithms in MLlib:
Classification
Regression
Recommendation
Clustering
Statistics
Linear Algebra
Pipelines
Machine Learning uses large datasets to identify (infer) patterns and make decisions (aka
predictions). Automated decision making is what makes Machine Learning so appealing.
You can teach a system from a dataset and let the system act by itself to predict future.
The amount of data (measured in TB or PB) is what makes Spark MLlib especially important
since a human could not possibly extract much value from the dataset in a short time.
Spark handles data distribution and makes the huge data available by means of RDDs,
DataFrames, and recently Datasets.
Use cases for Machine Learning (and hence Spark MLlib that comes with appropriate
algorithms):
Operational optimizations
Concepts
This section introduces the concepts of Machine Learning and how they are modeled in
Spark MLlib.
Observation
An observation is used to learn about or evaluate (i.e. draw conclusions about) the
observed item’s target value.
Feature
A feature (aka dimension or variable) is an attribute of an observation. It is an independent
variable.
Spark models features as columns in a DataFrame (one per feature or a set of features).
Categorical with discrete values, i.e. the set of possible values is limited, and can range
from one to many thousands. There is no ordering implied, and so the values are
incomparable.
Numerical with quantitative values, i.e. any numerical values that you can compare to
each other. You can further classify them into discrete and continuous features.
Label
A label is a variable that a machine learning system learns to predict that are assigned to
observations.
FP-growth Algorithm
Spark 1.5 have significantly improved on frequent pattern mining capabilities with new
algorithms for association rule generation and sequential pattern mining.
Frequent Itemset Mining using the Parallel FP-growth algorithm (since Spark 1.3)
finds popular routing paths that generate most traffic in a particular region
the algorithm looks for common subsets of items that appear across transactions,
e.g. sub-paths of the network that are frequently traversed.
A naive solution: generate all possible itemsets and count their occurrence
the algorithm finds all frequent itemsets without generating and testing all
candidates
retailer could then use this information, put both toothbrush and floss on sale, but
raise the price of toothpaste to increase overall profit.
FPGrowth model
extract frequent sequential patterns like routing updates, activation failures, and
broadcasting timeouts that could potentially lead to customer complaints and
proactively reach out to customers when it happens.
Power Iteration Clustering (PIC) in MLlib, a simple and scalable graph clustering
method
org.apache.spark.mllib.clustering.PowerIterationClustering
a graph algorithm
takes an undirected graph with similarities defined on edges and outputs clustering
assignment on nodes
The edge properties are cached and remain static during the power iterations.
New MLlib Algorithms in Spark 1.3: FP-Growth and Power Iteration Clustering
You may also think of two additional steps before the final model becomes production ready
and hence of any use:
The goal of the Pipeline API (aka Spark ML or spark.ml given the package the API lives in)
is to let users quickly and easily assemble and configure practical machine learning
pipelines (aka workflows) by standardizing the APIs for different Machine Learning concepts.
The ML Pipeline API is a new DataFrame-based API developed under the spark.ml
package.
The old RDD-based API has been developed in parallel under the spark.mllib
Note package. It has been proposed to switch RDD-based MLlib APIs to
maintenance mode in Spark 2.0.
Transformers
Models
Estimators
Evaluators
You use a collection of Transformer instances to prepare input DataFrame - the dataset
with proper input data (in columns) for a chosen ML algorithm.
With a Model you can calculate predictions (in prediction column) on features input
column through DataFrame transformation.
Example: In text classification, preprocessing steps like n-gram extraction, and TF-IDF
feature weighting are often necessary before training of a classification model like an SVM.
Upon deploying a model, your system must not only know the SVM weights to apply to input
features, but also transform raw data into the format the model is trained on.
Components of ML Pipeline:
Pipelines become objects that can be saved out and applied in real-time to new
data.
A machine learning component is any object that belongs to Pipeline API, e.g.
Note
Pipeline, LinearRegressionModel, etc.
Parameter tuning
Pipelines
A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a
PipelineModel to an input dataset.
import org.apache.spark.ml.Pipeline
Pipeline instances).
The Pipeline object can read or load pipelines (refer to Persisting Machine Learning
Components page).
read: MLReader[Pipeline]
load(path: String): Pipeline
You can create a Pipeline with an optional uid identifier. It is of the format
pipeline_[randomUid] when unspecified.
scala> println(pipeline.uid)
pipeline_94be47c3b709
scala> println(pipeline.uid)
my_pipeline
The fit method returns a PipelineModel that holds a collection of Transformer objects
that are results of Estimator.fit method for every Estimator in the Pipeline (with possibly-
modified dataset ) or simply input Transformer objects. The input dataset DataFrame is
passed to transform for every Transformer instance in the Pipeline.
It then searches for the index of the last Estimator to calculate Transformers for Estimator
and simply return Transformer back up to the index in the pipeline. For each Estimator the
fit method is called with the input dataset . The result DataFrame is passed to the next
transform method is called for every Transformer calculated but the last one (that is the
The method returns a PipelineModel with uid and transformers. The parent Estimator is
the Pipeline itself.
PipelineStage
The PipelineStage abstract class represents a single stage in a Pipeline.
PipelineStage has the following direct implementations (of which few are abstract classes,
too):
Estimators
Models
Pipeline
Predictor
Transformer
(video) Building, Debugging, and Tuning Spark Machine Learning Pipelines - Joseph
Bradley (Databricks)
(video) Spark MLlib: Making Practical Machine Learning Easy and Scalable
Transformers
A transformer is a function that maps (aka transforms) a DataFrame into another
DataFrame (both called datasets).
Transformers prepare a dataset for an machine learning algorithm to work with. They are
also very helpful to transform DataFrames in general, outside the machine learning space.
StopWordsRemover
Binarizer
SQLTransformer
UnaryTransformer
Tokenizer
RegexTokenizer
NGram
HashingTF
OneHotEncoder
Model
StopWordsRemover
StopWordsRemover is a machine learning feature transformer that takes a string array column
and outputs a string array column with all defined stop words removed. The transformer
comes with a standard set of English stop words as default (that are the same as scikit-learn
uses, i.e. from the Glasgow Information Retrieval Group).
import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover
scala> println(stopWords.explainParams)
caseSensitive: whether to do case-sensitive comparison during filtering (default: false)
inputCol: input column name (undefined)
outputCol: output column name (default: stopWords_9c2c0fdd8a68__output)
stopWords: stop words (default: [Ljava.lang.String;@5dabe7c8)
null values from the input array are preserved unless adding null to
Note
stopWords explicitly.
import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer("regexTok")
.setInputCol("text")
.setPattern("\\W+")
import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover("stopWords")
.setInputCol(regexTok.getOutputCol)
scala> stopWords.transform(regexTok.transform(df)).show(false)
+-------------------------------+---+------------------------------------+-----------------+
|text |id |regexTok__output |stopWords__output|
+-------------------------------+---+------------------------------------+-----------------+
|please find it done (and empty)|0 |[please, find, it, done, and, empty]|[] |
|About to be rich! |1 |[about, to, be, rich] |[rich] |
|empty |2 |[empty] |[] |
+-------------------------------+---+------------------------------------+-----------------+
Binarizer
Binarizer is a Tokenizer that splits the values in the input column into two groups - "ones"
for values larger than the threshold and "zeros" for the others.
It works with DataFrames with the input column of DoubleType or VectorUDT. The type of
the result output column matches the type of the input column, i.e. DoubleType or
VectorUDT .
import org.apache.spark.ml.feature.Binarizer
val bin = new Binarizer()
.setInputCol("rating")
.setOutputCol("label")
.setThreshold(3.5)
scala> println(bin.explainParams)
inputCol: input column name (current: rating)
outputCol: output column name (default: binarizer_dd9710e2a831__output, current: label)
threshold: threshold used to binarize continuous features (default: 0.0, current: 3.5)
scala> bin.transform(doubles).show
+---+------+-----+
| id|rating|label|
+---+------+-----+
| 0| 1.0| 0.0|
| 1| 1.0| 0.0|
| 2| 5.0| 1.0|
+---+------+-----+
import org.apache.spark.mllib.linalg.Vectors
val denseVec = Vectors.dense(Array(4.0, 0.4, 3.7, 1.5))
val vectors = Seq((0, denseVec)).toDF("id", "rating")
scala> bin.transform(vectors).show
+---+-----------------+-----------------+
| id| rating| label|
+---+-----------------+-----------------+
| 0|[4.0,0.4,3.7,1.5]|[1.0,0.0,1.0,0.0]|
+---+-----------------+-----------------+
SQLTransformer
SQLTransformer is a Tokenizer that does transformations by executing SELECT … FROM THIS
with THIS being the underlying temporary table registered for the input dataset.
Internally, THIS is replaced with a random name for a temporary table (using
registerTempTable).
It requires that the SELECT query uses THIS that corresponds to a temporary table and
simply executes the mandatory statement using sql method.
You have to specify the mandatory statement parameter using setStatement method.
import org.apache.spark.ml.feature.SQLTransformer
val sql = new SQLTransformer()
scala> println(sql.explainParams)
statement: SQL statement (current: SELECT sentence FROM __THIS__ WHERE label = 0)
UnaryTransformers
The UnaryTransformer abstract class is a specialized Transformer that applies
transformation to one input column and writes results to another (by appending a new
column).
Each UnaryTransformer defines the input and output columns using the following "chain"
methods (they return the transformer on which they were executed and so are chainable):
setInputCol(value: String)
setOutputCol(value: String)
When transform is called, it first calls transformSchema (with DEBUG logging enabled) and
then adds the column as a result of calling a protected abstract createTransformFunc .
Internally, transform method uses Spark SQL’s udf to define a function (based on
createTransformFunc function described above) that will create the new output column (with
appropriate outputDataType ). The UDF is later applied to the input column of the input
DataFrame and the result becomes the output column (using DataFrame.withColumn
method).
Tokenizer that converts the input string to lowercase and then splits it by white spaces.
NGram that converts the input array of strings into an array of n-grams.
HashingTF that maps a sequence of terms to their term frequencies (cf. SPARK-13998
HashingTF should extend UnaryTransformer)
OneHotEncoder that maps a numeric input column of label indices onto a column of
binary vectors.
Tokenizer
Tokenizer is a UnaryTransformer that converts the input string to lowercase and then splits
it by white spaces.
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()
// dataset to transform
val df = Seq((1, "Hello world!"), (2, "Here is yet another sentence.")).toDF("label", "sentence"
scala> tokenized.show(false)
+-----+-----------------------------+-----------------------------------+
|label|sentence |tok_b66af4001c8d__output |
+-----+-----------------------------+-----------------------------------+
|1 |Hello world! |[hello, world!] |
|2 |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+-----+-----------------------------+-----------------------------------+
RegexTokenizer
RegexTokenizer is a UnaryTransformer that tokenizes a String into a collection of String .
import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer()
scala> tokenized.show(false)
+-----+------------------+-----------------------------+
|label|sentence |regexTok_810b87af9510__output|
+-----+------------------+-----------------------------+
|0 |hello world |[hello, world] |
|1 |two spaces inside|[two, spaces, inside] |
+-----+------------------+-----------------------------+
It supports minTokenLength parameter that is the minimum token length that you can change
using setMinTokenLength method. It simply filters out smaller tokens and defaults to 1 .
scala> rt.setInputCol("line").setMinTokenLength(6).transform(df).show
+-----+--------------------+-----------------------------+
|label| line|regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
| 1| hello world| []|
| 2|yet another sentence| [another, sentence]|
+-----+--------------------+-----------------------------+
It has gaps parameter that indicates whether regex splits on gaps ( true ) or matches
tokens ( false ). You can set it using setGaps . It defaults to true .
When set to true (i.e. splits on gaps) it uses Regex.split while Regex.findAllIn for false .
scala> rt.setInputCol("line").setGaps(false).transform(df).show
+-----+--------------------+-----------------------------+
|label| line|regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
| 1| hello world| []|
| 2|yet another sentence| [another, sentence]|
+-----+--------------------+-----------------------------+
scala> rt.setInputCol("line").setGaps(false).setPattern("\\W").transform(df).show(false)
+-----+--------------------+-----------------------------+
|label|line |regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
|1 |hello world |[] |
|2 |yet another sentence|[another, sentence] |
+-----+--------------------+-----------------------------+
It has pattern parameter that is the regex for tokenizing. It uses Scala’s .r method to
convert the string to regex. Use setPattern to set it. It defaults to \\s+ .
It has toLowercase parameter that indicates whether to convert all characters to lowercase
before tokenizing. Use setToLowercase to change it. It defaults to true .
NGram
In this example you use org.apache.spark.ml.feature.NGram that converts the input
collection of strings into a collection of n-grams (of n words).
import org.apache.spark.ml.feature.NGram
+---+--------------+---------------+
| id| tokens|bigrams__output|
+---+--------------+---------------+
| 0|[hello, world]| [hello world]|
+---+--------------+---------------+
HashingTF
Another example of a transformer is org.apache.spark.ml.feature.HashingTF that works on a
Column of ArrayType .
It transforms the rows for the input column into a sparse term frequency vector.
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(5000)
// Use HashingTF
val hashedDF = hashingTF.transform(regexedDF)
scala> hashedDF.show(false)
+---+------------------+---------------------+-----------------------------------+
|id |text |words |features |
+---+------------------+---------------------+-----------------------------------+
|0 |hello world |[hello, world] |(5000,[2322,3802],[1.0,1.0]) |
|1 |two spaces inside|[two, spaces, inside]|(5000,[276,940,2533],[1.0,1.0,1.0])|
+---+------------------+---------------------+-----------------------------------+
The name of the output column is optional, and if not specified, it becomes the identifier of a
HashingTF object with the __output suffix.
scala> hashingTF.uid
res7: String = hashingTF_fe3554836819
scala> hashingTF.transform(regexDF).show(false)
+---+------------------+---------------------+-------------------------------------------+
|id |text |words |hashingTF_fe3554836819__output |
+---+------------------+---------------------+-------------------------------------------+
|0 |hello world |[hello, world] |(262144,[71890,72594],[1.0,1.0]) |
|1 |two spaces inside|[two, spaces, inside]|(262144,[53244,77869,115276],[1.0,1.0,1.0])|
+---+------------------+---------------------+-------------------------------------------+
OneHotEncoder
OneHotEncoder is a Tokenizer that maps a numeric input column of label indices onto a
// dataset to transform
val df = Seq(
(0, "a"), (1, "b"),
(2, "c"), (3, "a"),
(4, "a"), (5, "c"))
.toDF("label", "category")
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("category").setOutputCol("cat_index").fit(df)
val indexed = indexer.transform(df)
import org.apache.spark.sql.types.NumericType
scala> indexed.schema("cat_index").dataType.isInstanceOf[NumericType]
res0: Boolean = true
import org.apache.spark.ml.feature.OneHotEncoder
val oneHot = new OneHotEncoder()
.setInputCol("cat_index")
.setOutputCol("cat_vec")
scala> oneHotted.show(false)
+-----+--------+---------+-------------+
|label|category|cat_index|cat_vec |
+-----+--------+---------+-------------+
|0 |a |0.0 |(2,[0],[1.0])|
|1 |b |2.0 |(2,[],[]) |
|2 |c |1.0 |(2,[1],[1.0])|
|3 |a |0.0 |(2,[0],[1.0])|
|4 |a |0.0 |(2,[0],[1.0])|
|5 |c |1.0 |(2,[1],[1.0])|
+-----+--------+---------+-------------+
scala> oneHotted.printSchema
root
|-- label: integer (nullable = false)
|-- category: string (nullable = true)
|-- cat_index: double (nullable = true)
|-- cat_vec: vector (nullable = true)
scala> oneHotted.schema("cat_vec").dataType.isInstanceOf[VectorUDT]
res1: Boolean = true
Custom UnaryTransformer
The following class is a custom UnaryTransformer that transforms words using upper letters.
package pl.japila.spark
import org.apache.spark.ml._
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
scala> upper.setInputCol("text").transform(df).show
+---+-----+--------------------------+
| id| text|upper_0b559125fd61__output|
+---+-----+--------------------------+
| 0|hello| HELLO|
| 1|world| WORLD|
+---+-----+--------------------------+
Estimators
An estimator is an abstraction of a learning algorithm that fits a model on dataset.
That was so machine learning to explain an estimator this way, wasn’t it? It is
Note that the more I spend time with Pipeline API the often I use the terms and
phrases from this space. Sorry.
Technically, an Estimator produces a Model (i.e. a Transformer) for a given DataFrame and
parameters (as ParamMap ). It fits a model to the input DataFrame and ParamMap to produce
a Transformer (a Model ) that can calculate predictions for any DataFrame -based input
datasets.
It is basically a function that maps a DataFrame onto a Model through fit method, i.e. it
takes a DataFrame and produces a Transformer as a Model .
fit(dataset: DataFrame): M
Some of the direct specialized implementations of the Estimator abstract class are as
follows:
KMeans
TrainValidationSplit
Predictors
KMeans
import org.apache.spark.ml.clustering._
val kmeans = new KMeans()
scala> println(kmeans.explainParams)
featuresCol: features column name (default: features)
initMode: initialization algorithm (default: k-means||)
initSteps: number of steps for k-means|| (default: 5)
k: number of clusters to create (default: 2)
maxIter: maximum number of iterations (>= 0) (default: 20)
predictionCol: prediction column name (default: prediction)
seed: random seed (default: -1689246527)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-4)
type IntegerType .
Internally, fit method "unwraps" the feature vector in featuresCol column in the input
DataFrame and creates an RDD[Vector] . It then hands the call over to the MLlib variant of
Each item (row) in a data set is described by a numeric vector of attributes called features .
A single feature (a dimension of the vector) represents a word (token) with a value that is a
metric that defines the importance of that word or term in the document.
Refer to Logging.
KMeans Example
You can represent a text corpus (document collection) using the vector space model. In this
representation, the vectors have dimension that is the number of different words in the
corpus. It is quite natural to have vectors with a lot of zero values as not all words will be in a
document. We will use an optimized memory representation to avoid zero values using
sparse vectors.
This example shows how to use k-means to classify emails as a spam or not.
// NOTE Don't copy and paste the final case class with the other lines
// It won't work with paste mode in spark-shell
final case class Email(id: Int, text: String)
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans
scala> kmModel.clusterCenters.map(_.toSparse)
res36: Array[org.apache.spark.mllib.linalg.SparseVector] = Array((20,[13],[3.0]), (20,[0,
scala> .show(false)
+---------+------------+---------------------+----------+
|text |tokens |features |prediction|
+---------+------------+---------------------+----------+
|hello mom|[hello, mom]|(20,[2,19],[1.0,1.0])|1 |
+---------+------------+---------------------+----------+
TrainValidationSplit
Caution FIXME
Predictors
A Predictor is a specialization of Estimator for a PredictionModel with its own abstract
train method.
train(dataset: DataFrame): M
The train method is supposed to ease dealing with schema validation and copying
parameters to a trained PredictionModel model. It also sets the parent of the model to itself.
It implements the abstract fit(dataset: DataFrame) of the Estimator abstract class that
validates and transforms the schema of a dataset (using a custom transformSchema of
PipelineStage), and then calls the abstract train method.
LinearRegression
RandomForestRegressor
LinearRegression
LinearRegression is an example of Predictor (indirectly through the specialized Regressor
private abstract class), and hence Estimator, that represents the linear regression algorithm
in Machine Learning.
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
scala> println(lr.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0)
solver: the solver algorithm for optimization. If this is not set or empty, default value is
standardization: whether to standardize the training features before fitting the model (default
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as
LinearRegression.train
columns:
It returns LinearRegressionModel .
It first counts the number of elements in features column (usually features ). The column
has to be of mllib.linalg.Vector type (and can easily be prepared using HashingTF
transformer).
import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer()
val spamTokens = regexTok.setInputCol("email").transform(spam)
scala> spamTokens.show(false)
+---+--------------------------------+---------------------------------------+
|id |email |regexTok_646b6bcc4548__output |
+---+--------------------------------+---------------------------------------+
|0 |Hi Jacek. Wanna more SPAM? Best!|[hi, jacek., wanna, more, spam?, best!]|
|1 |This is SPAM. This is SPAM |[this, is, spam., this, is, spam] |
+---+--------------------------------+---------------------------------------+
import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(regexTok.getOutputCol)
.setOutputCol("features")
.setNumFeatures(5000)
scala> spamLabeled.show
+---+--------------------+-----------------------------+--------------------+-----+
| id| email|regexTok_646b6bcc4548__output| features|label|
+---+--------------------+-----------------------------+--------------------+-----+
| 0|Hi Jacek. Wanna m...| [hi, jacek., wann...|(5000,[2525,2943,...| 1.0|
| 1|This is SPAM. Thi...| [this, is, spam.,...|(5000,[1713,3149,...| 1.0|
+---+--------------------+-----------------------------+--------------------+-----+
scala> training.show
+---+--------------------+-----------------------------+--------------------+-----+
| id| email|regexTok_646b6bcc4548__output| features|label|
+---+--------------------+-----------------------------+--------------------+-----+
| 2|Hi Jacek. I hope ...| [hi, jacek., i, h...|(5000,[72,105,942...| 0.0|
| 3|Welcome to Apache...| [welcome, to, apa...|(5000,[2894,3365,...| 0.0|
| 0|Hi Jacek. Wanna m...| [hi, jacek., wann...|(5000,[2525,2943,...| 1.0|
| 1|This is SPAM. Thi...| [this, is, spam.,...|(5000,[1713,3149,...| 1.0|
+---+--------------------+-----------------------------+--------------------+-----+
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
scala> lrModel.transform(emailHashed).select("prediction").show
+-----------------+
| prediction|
+-----------------+
|0.563603440350882|
+-----------------+
RandomForestRegressor
RandomForestRegressor is a concrete Predictor for Random Forest learning algorithm. It
Caution FIXME
import org.apache.spark.mllib.linalg.Vectors
val features = Vectors.sparse(10, Seq((2, 0.2), (4, 0.4)))
scala> data.show(false)
+-----+--------------------------+
|label|features |
+-----+--------------------------+
|0.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|1.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|2.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|3.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|4.0 |(10,[2,4,6],[0.2,0.4,0.6])|
+-----+--------------------------+
scala> model.trees.foreach(println)
DecisionTreeRegressionModel (uid=dtr_247e77e2f8e0) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_61f8eacb2b61) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_63fc5bde051c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_64d4e42de85f) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_693626422894) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_927f8a0bc35e) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_82da39f6e4e1) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_cb94c2e75bd1) of depth 0 with 1 nodes
DecisionTreeRegressionModel (uid=dtr_29e3362adfb2) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_d6d896abcc75) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_aacb22a9143d) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_18d07dadb5b9) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_f0615c28637c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4619362d02fc) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d39502f828f4) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_896f3a4272ad) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_891323c29838) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_d658fe871e99) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d91227b13d41) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4a7976921f4b) of depth 2 with 5 nodes
scala> model.treeWeights
res12: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
scala> model.featureImportances
res13: org.apache.spark.mllib.linalg.Vector = (1,[0],[1.0])
Example
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = (0.0 to 9.0 by 1) // create a collection of Doubles
.map(n => (n, n)) // make it pairs
.map { case (label, feature) =>
LabeledPoint(label, Vectors.dense(feature)) } // create labeled points of dense vectors
.toDF // make it a DataFrame
scala> data.show
+-----+--------+
|label|features|
+-----+--------+
| 0.0| [0.0]|
| 1.0| [1.0]|
| 2.0| [2.0]|
| 3.0| [3.0]|
| 4.0| [4.0]|
| 5.0| [5.0]|
| 6.0| [6.0]|
| 7.0| [7.0]|
| 8.0| [8.0]|
| 9.0| [9.0]|
+-----+--------+
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
scala> model.intercept
res1: Double = 0.0
scala> model.coefficients
res2: org.apache.spark.mllib.linalg.Vector = [1.0]
// make predictions
scala> val predictions = model.transform(data)
predictions: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 more field]
scala> predictions.show
+-----+--------+----------+
|label|features|prediction|
+-----+--------+----------+
| 0.0| [0.0]| 0.0|
| 1.0| [1.0]| 1.0|
| 2.0| [2.0]| 2.0|
| 3.0| [3.0]| 3.0|
| 4.0| [4.0]| 4.0|
| 5.0| [5.0]| 5.0|
| 6.0| [6.0]| 6.0|
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.mllib.linalg.DenseVector
// NOTE Follow along to learn spark.ml-way (not RDD-way)
predictions.rdd.map { r =>
(r(0).asInstanceOf[Double], r(1).asInstanceOf[DenseVector](0).toDouble, r(2).asInstanceOf[
.toDF("label", "feature0", "prediction").show
+-----+--------+----------+
|label|feature0|prediction|
+-----+--------+----------+
| 0.0| 0.0| 0.0|
| 1.0| 1.0| 1.0|
| 2.0| 2.0| 2.0|
| 3.0| 3.0| 3.0|
| 4.0| 4.0| 4.0|
| 5.0| 5.0| 5.0|
| 6.0| 6.0| 6.0|
| 7.0| 7.0| 7.0|
| 8.0| 8.0| 8.0|
| 9.0| 9.0| 9.0|
+-----+--------+----------+
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.DenseVector
case class Prediction(label: Double, feature0: Double, prediction: Double)
object Prediction {
def apply(r: Row) = new Prediction(
label = r(0).asInstanceOf[Double],
feature0 = r(1).asInstanceOf[DenseVector](0).toDouble,
prediction = r(2).asInstanceOf[Double])
}
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.DenseVector
scala> predictions.rdd.map(Prediction.apply).toDF.show
+-----+--------+----------+
|label|feature0|prediction|
+-----+--------+----------+
| 0.0| 0.0| 0.0|
| 1.0| 1.0| 1.0|
| 2.0| 2.0| 2.0|
| 3.0| 3.0| 3.0|
| 4.0| 4.0| 4.0|
| 5.0| 5.0| 5.0|
| 6.0| 6.0| 6.0|
| 7.0| 7.0| 7.0|
| 8.0| 8.0| 8.0|
| 9.0| 9.0| 9.0|
+-----+--------+----------+
Models
Model abstract class is a Transformer with the optional Estimator that has produced it (as a
An Estimator is optional and is available only after fit (of an Estimator) has
Note
been executed whose result a model is.
There are two direct implementations of the Model class that are not directly related to a
concrete ML algorithm:
PipelineModel
PredictionModel
PipelineModel
Once fit, you can use the result model as any other models to transform datasets (as
DataFrame ).
// Transformer #1
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer().setInputCol("text")
// Transformer #2
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
PredictionModel
PredictionModel is an abstract class to represent a model for prediction algorithms like
regression and classification (that have their own specialized models - details coming up
below).
import org.apache.spark.ml.PredictionModel
The contract of PredictionModel class requires that every custom implementation defines
predict method (with FeaturesType type being the type of features ).
RegressionModel
ClassificationModel
RandomForestRegressionModel
Internally, transform first ensures that the type of the features column matches the type
of the model and adds the prediction column of type Double to the schema of the result
DataFrame .
It then creates the result DataFrame and adds the prediction column with a predictUDF
function applied to the values of the features column.
FIXME A diagram to show the transformation from a dataframe (on the left)
Caution and another (on the right) with an arrow to represent the transformation
method.
Refer to Logging.
ClassificationModel
ClassificationModel is a PredictionModel that transforms a DataFrame with mandatory
features , label , and rawPrediction (of type Vector) columns to a DataFrame with
ClassificationModel comes with its own transform (as Transformer) and predict (as
PredictionModel).
models)
DecisionTreeClassificationModel ( final )
LogisticRegressionModel
NaiveBayesModel
RandomForestClassificationModel ( final )
RegressionModel
RegressionModel is a PredictionModel that transforms a DataFrame with mandatory label ,
It comes with no own methods or values and so is more a marker abstract class (to combine
different features of regression models under one type).
LinearRegressionModel
LinearRegressionModel represents a model produced by a LinearRegression estimator. It
label (required)
features (required)
prediction
regParam
elasticNetParam
maxIter (Int)
tol (Double)
fitIntercept (Boolean)
standardization (Boolean)
weightCol (String)
solver (String)
With DEBUG logging enabled (see above) you can see the following messages in the logs
when transform is called and transforms the schema.
The coefficients Vector and intercept Double are the integral part of
Note
LinearRegressionModel as the required input parameters of the constructor.
LinearRegressionModel Example
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
// Importing LinearRegressionModel and being explicit about the type of model value
// is for learning purposes only
import org.apache.spark.ml.regression.LinearRegressionModel
val model: LinearRegressionModel = lr.fit(ds)
RandomForestRegressionModel
KMeansModel
KMeansModel is a Model of KMeans algorithm.
// See spark-mllib-estimators.adoc#KMeans
val kmeans: KMeans = ???
val trainingDF: DataFrame = ???
val kmModel = kmeans.fit(trainingDF)
scala> kmModel.transform(inputDF).show(false)
+-----+---------+----------+
|label|features |prediction|
+-----+---------+----------+
|0.0 |[0.2,0.4]|0 |
+-----+---------+----------+
Evaluators
A evaluator is a transformation that maps a DataFrame into a metric indicating how good a
model is.
BinaryClassificationEvaluator
RegressionEvaluator
BinaryClassificationEvaluator
BinaryClassificationEvaluator is a concrete Evaluator for binary classification that
rawPrediction
label
RegressionEvaluator
RegressionEvaluator is a concrete Evaluator for regression that expects datasets (of
rmse (default; larger is better? no) is the root mean squared error.
import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(tok.getOutputCol) // it reads the output of tok
.setOutputCol("features")
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
// Let's do prediction
// Note that we're using the same dataset as for fitting the model
// Something you'd definitely not be doing in prod
val predictions = model.transform(dataset)
import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
scala> regEval.evaluate(predictions)
res0: Double = 0.0
CrossValidator
Caution FIXME Needs more love to be finished.
import org.apache.spark.ml.tuning.CrossValidator
What makes CrossValidator a very useful tool for model selection is its ability
to work with any Estimator instance, Pipelines including, that can preprocess
Note
datasets before passing them on. This gives you a way to work with any dataset
and preprocess it before a new (possibly better) model could be fit to it.
import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator
scala> println(cv.explainParams)
estimator: estimator for selection (undefined)
estimatorParamMaps: param maps for the estimator (undefined)
evaluator: evaluator used to select hyper-parameters that maximize the validated metric (undefined)
numFolds: number of folds for cross validation (>= 2) (default: 3)
seed: random seed (default: -1191137437)
import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(tok.getOutputCol)
.setOutputCol("features")
.setNumFeatures(10)
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans
import org.apache.spark.ml.Pipeline
// 0 = scientific text
// 1 = non-scientific text
val trainDS = Seq(
(0L, "[science] hello world", 0),
(1L, "long text", 1),
(2L, "[science] hello all people", 0),
(3L, "[science] hello hello", 0)).toDF("id", "text", "label").cache
import org.apache.spark.ml.tuning.ParamGridBuilder
val paramGrid = new ParamGridBuilder().build()
import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator()
.setEstimator(pipeline) // <-- pipeline is the estimator
.setEvaluator(regEval)
.setEstimatorParamMaps(paramGrid)
import org.apache.spark.mllib.linalg.Vectors
val features = Vectors.sparse(3, Array(1), Array(1d))
val df = Seq(
(0, "hello world", 0.0, features),
(1, "just hello", 1.0, features)).toDF("id", "text", "label", "features")
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
import org.apache.spark.ml.tuning.ParamGridBuilder
// Parameterize the only estimator used, i.e. LogisticRegression
// Use println(lr.explainParams) to learn about the supported parameters
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.build()
import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator()
.setEstimator(lr) // just LogisticRegression not Pipeline
.setEvaluator(regEval)
.setEstimatorParamMaps(paramGrid)
// FIXME
MLWriter
MLWriter abstract class comes with save(path: String) method to save a machine
It comes with another (chainable) method overwrite to overwrite the output path if it
already exists.
overwrite(): this.type
The component is saved into a JSON file (see MLWriter Example section below).
Enable INFO logging level for the MLWriter implementation logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.ml.Pipeline$.PipelineWriter=INFO
Refer to Logging.
FIXME The logging doesn’t work and overwriting does not print out INFO
Caution
message to the logs :(
MLWriter Example
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array())
pipeline.write.overwrite().save("hello-pipeline")
$ cat hello-pipeline/metadata/part-00000 | jq
{
"class": "org.apache.spark.ml.Pipeline",
"timestamp": 1457685293319,
"sparkVersion": "2.0.0-SNAPSHOT",
"uid": "pipeline_12424a3716b2",
"paramMap": {
"stageUids": []
}
}
MLReader
MLReader abstract class comes with load(path: String) method to load a machine
import org.apache.spark.ml._
val pipeline = Pipeline.read.load("hello-pipeline")
Example — Text Classification
The example was inspired by the video Building, Debugging, and Tuning Spark
Note
Machine Learning Pipelines - Joseph Bradley (Databricks).
The example uses a case class LabeledText to have the schema described
Note
nicely.
import sqlContext.implicits._
val data = Seq(LabeledText(0, Scientific, "hello world"), LabeledText(1, NonScientific, "witaj swieci
scala> data.show
+-----+-------------+
|label| text|
+-----+-------------+
| 0| hello world|
| 1|witaj swiecie|
+-----+-------------+
It is then tokenized and transformed into another DataFrame with an additional column
called features that is a Vector of numerical values.
Note Paste the code below into Spark Shell using :paste mode.
import sqlContext.implicits._
Now, the tokenization part comes that maps the input text of each text document into tokens
(a Seq[String] ) and then into a Vector of numerical values that can only then be
understood by a machine learning algorithm (that operates on Vector instances).
scala> papers.show
+---+------------+----------------+
| id| topic| text|
+---+------------+----------------+
| 0| sci.math| Hello, Math!|
| 1|alt.religion|Hello, Religion!|
| 2| sci.physics| Hello, Physics!|
+---+------------+----------------+
scala> training.show
+---+------------+----------------+-----+
| id| topic| text|label|
+---+------------+----------------+-----+
| 0| sci.math| Hello, Math!| 1.0|
| 1|alt.religion|Hello, Religion!| 0.0|
| 2| sci.physics| Hello, Physics!| 1.0|
+---+------------+----------------+-----+
scala> training.groupBy("label").count.show
+-----+-----+
|label|count|
+-----+-----+
| 0.0| 1|
| 1.0| 2|
+-----+-----+
The train a model phase uses the logistic regression machine learning algorithm to build a
model and predict label for future input text documents (and hence classify them as
scientific or non-scientific).
import org.apache.spark.ml.feature.RegexTokenizer
val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features"
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
It uses two columns, namely label and features vector to build a logistic regression
model to make predictions.
scala> predictions.show
+---+------------+----------------+-----+-------------------+--------------------+-------------------
| id| topic| text|label| words| features| rawPredictio
+---+------------+----------------+-----+-------------------+--------------------+-------------------
| 0| sci.math| Hello, Math!| 1.0| [hello, math!]| (5000,[563],[1.0])|[-3.5586272181164
| 1|alt.religion|Hello, Religion!| 0.0| [hello, religion!]| (5000,[4298],[1.0])|[3.18473454618966
| 2| sci.physics| Hello, Physics!| 1.0|[hello, phy, ic, !]|(5000,[33,2499,33...|[-4.4061570147914
+---+------------+----------------+-----+-------------------+--------------------+-------------------
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC")
scala> evaluator.evaluate(predictions)
res42: Double = 1.0
import org.apache.spark.ml.tuning.ParamGridBuilder
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(1000, 10000))
.addGrid(lr.regParam, Array(0.05, 0.2))
.build
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.param._
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2)
FIXME Review
Caution
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/tuning
You can eventually save the model for later use (using DataFrame.write ).
cvModel.transform(test).select("id", "prediction")
.write
.json("/demo/predictions")
Example — Linear Regression
The DataFrame used for Linear Regression has to have features column of
org.apache.spark.mllib.linalg.VectorUDT type.
Note You can change the name of the column using featuresCol parameter.
scala> println(lr.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an
featuresCol: features column name (default: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0)
solver: the solver algorithm for optimization. If this is not set or empty, default value is
standardization: whether to standardize the training features before fitting the model (default
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline("my_pipeline")
import org.apache.spark.ml.regression._
val lr = new LinearRegression
Topic modeling is a type of model that can be very useful in identifying hidden thematic
structure in documents. Broadly speaking, it aims to find structure within an unstructured
collection of documents. Once the structure is "discovered", you may answer questions like:
Spark MLlib offers out-of-the-box support for Latent Dirichlet Allocation (LDA) which is the
first MLlib algorithm built upon GraphX.
Vector
Vector sealed trait represents a numeric vector of values (of Double type) and their
Note package.
It is not the Vector type in Scala or Java. Train your eyes to see two types of
the same name. You’ve been warned.
There are exactly two available implementations of Vector sealed trait (that also belong to
org.apache.spark.mllib.linalg package):
DenseVector
SparseVector
Vector 588
Mastering Apache Spark
import org.apache.spark.mllib.linalg.Vectors
// You can create dense vectors explicitly by giving values per index
val denseVec = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5))
val almostAllZeros = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
// You can however create a sparse vector by the size and non-zero elements
val sparse = Vectors.sparse(10, Seq((1, 0.4), (2, 0.3), (3, 1.5)))
import org.apache.spark.mllib.linalg._
scala> sv.size
res0: Int = 5
scala> sv.toArray
res1: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)
scala> sv == sv.copy
res2: Boolean = true
scala> sv.toJson
res3: String = {"type":0,"size":5,"indices":[0,1,2,3,4],"values":[1.0,1.0,1.0,1.0,1.0]}
Vector 589
Mastering Apache Spark
LabeledPoint
Caution FIXME
LabeledPoint is a convenient class for declaring a schema for DataFrames that are used as
LabeledPoint 590
Mastering Apache Spark
Streaming MLlib
The following Machine Learning algorithms have their streaming variants in MLlib:
k-means
Linear Regression
Logistic Regression
Note The streaming algorithms belong to spark.mllib (the older RDD-based API).
Streaming k-means
org.apache.spark.mllib.clustering.StreamingKMeans
Sources
Streaming Machine Learning in Spark- Jeremy Freeman (HHMI Janelia Research
Center)
GraphX models graphs as property graphs where vertices and edges can have properties.
Graph
Graph abstract class represents a collection of vertices and edges .
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertices: RDD[(VertexId, String)] =
sc.parallelize(Seq(
(0L, "Jacek"),
(1L, "Agata"),
(2L, "Julian")))
Transformations
mapVertices
mapEdges
mapTriplets
reverse
subgraph
mask
groupEdges
Joins
outerJoinVertices
Computation
aggregateMessages
fromEdgeTuples
fromEdges
apply
GraphImpl
GraphImpl is the default implementation of Graph abstract class.
Such a situation, in which we need to find the best matching in a weighted bipartite
graph, poses what is known as the stable marriage problem. It is a classical problem
that has a well-known solution, the Gale–Shapley algorithm.
Graph Algorithms
GraphX comes with a set of built-in graph algorithms.
PageRank
Triangle Count
Connected Components
Identifies independent disconnected subgraphs.
Collaborative Filtering
What kinds of people like what kinds of products.
Logging
Spark uses log4j for logging.
The valid log levels are "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE",
"WARN".
conf/log4j.properties
You can set up the default logging for Spark shell in conf/log4j.properties . Use
conf/log4j.properties.template as a starting point.
Spark applications
In standalone Spark applications or while in Spark Shell session, use the following:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
sbt
When running a Spark application from within sbt using run task, you can use the following
build.sbt to configure logging levels:
With the above configuration log4j.properties file should be on CLASSPATH which can be
in src/main/resources directory (that is included in CLASSPATH by default).
When run starts, you should see the following output in sbt:
Logging 596
Mastering Apache Spark
[spark-activator]> run
[info] Running StreamingApp
log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader@1
log4j: Using URL [file:/Users/jacek/dev/oss/spark-activator/target/scala-2.11/classes/log4j.propertie
log4j: Reading configuration from URL file:/Users/jacek/dev/oss/spark-activator/target/scala-2.11/cla
Logging 597
Mastering Apache Spark
Performance Tuning
Goal: Improve Spark’s performance where feasible.
a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60
machine cluster with 2.5TB of data.
network optimization could only reduce job completion time by, at most, 2%
From Making Sense of Spark Performance - Kay Ousterhout (UC Berkeley) at Spark
Summit 2015:
reduceByKey is better
impacts CPU - time to serialize and network - time to send the data over the wire
Metrics System
Spark uses Metrics - a Java library to measure the behavior of the components.
Spark.
FIXME Review
How to use the metrics to monitor Spark using jconsole?
ApplicationSource
WorkerSource
Caution
ExecutorSource
JvmSource
MesosClusterSchedulerSource
StreamingSource
Review MetricsServlet
Default properties
"*.sink.servlet.class", "org.apache.spark.metrics.sink.MetricsServlet"
"*.sink.servlet.path", "/metrics/json"
"master.sink.servlet.path", "/metrics/master/json"
"applications.sink.servlet.path", "/metrics/applications/json"
Executors
A non-local executor registers executor source.
Master
$ http http://192.168.1.4:8080/metrics/master/json/path
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 207
Content-Type: text/json;charset=UTF-8
Server: Jetty(8.y.z-SNAPSHOT)
X-Frame-Options: SAMEORIGIN
{
"counters": {},
"gauges": {
"master.aliveWorkers": {
"value": 0
},
"master.apps": {
"value": 0
},
"master.waitingApps": {
"value": 0
},
"master.workers": {
"value": 0
}
},
"histograms": {},
"meters": {},
"timers": {},
"version": "3.0.0"
}
Scheduler Listeners
A Spark listener is a class that listens to execution events from DAGScheduler. It extends
org.apache.spark.scheduler.SparkListener.
when the driver registers a new executor (FIXME or is this to let the driver know about
the new executor?).
Listener Bus
Listener Events
FIXME What are SparkListenerEvents? Where and why are they posted?
Caution
What do they cause?
SparkListenerEnvironmentUpdate
SparkListenerApplicationStart
FIXME
SparkListenerExecutorAdded
SparkListenerExecutorAdded is posted as a result of:
Calling MesosSchedulerBackend.resourceOffers.
SparkListener.onExecutorAdded(executorAdded) method.
Spark Listeners
Caution FIXME What do the listeners do? Move them to appropriate sections.
EventLoggingListener
web UI.
ExecutorAllocationListener
HeartbeatReceiver
setting.
It is assumed that the listener comes with one of the following (in this order):
a zero-argument constructor
Internal Listeners
web UI and event logging listeners
Event Logging
Spark comes with its own EventLoggingListener Spark listener that logs events to persistent
storage.
When the listener starts, it prints out the INFO message Logging events to [logPath] and
logs JSON-encoded events using JsonProtocol class.
spark.eventLog.enabled (default: false ) - whether to log Spark events that encode the
are logged, e.g. hdfs://namenode:8021/directory . The directory must exist before Spark
starts up.
streams.
$ ./bin/spark-shell --conf \
spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
...
INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener
...
Settings
spark.extraListeners (default: empty) is a comma-separated list of listener class
names that should be registered with Spark’s listener bus when SparkContext is
initialized.
Exercise
Building Spark
You can download pre-packaged versions of Apache Spark from the project’s web site. The
packages are built for a different Hadoop versions, but only for Scala 2.10.
Since [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version the
Note
default version of Scala is 2.11.
If you want a Scala 2.11 version of Apache Spark "users should download the Spark source
package and build with Scala 2.11 support" (quoted from the Note at Download Spark).
The build process for Scala 2.11 takes around 15 mins (on a decent machine) and is so
simple that it’s unlikely to refuse the urge to do it yourself.
Please note the messages that say the version of Spark (Building Spark Project Parent POM
2.0.0-SNAPSHOT), Scala version (maven-clean-plugin:2.6.1:clean (default-clean) @ spark-
parent_2.11) and the Spark modules built.
The above command gives you the latest version of Apache Spark 2.0.0-SNAPSHOT built
for Scala 2.11.7 (see the configuration of scala-2.11 profile).
Tip You can also know the version of Spark using ./bin/spark-shell --version .
Making Distribution
./make-distribution.sh is the shell script to make a distribution. It uses the same profiles as
Once finished, you will have the distribution in the current directory, i.e. spark-2.0.0-
SNAPSHOT-bin-2.7.1.tgz .
Caution FIXME
Introduction to Hadoop
This page is the place to keep information more general about Hadoop and not
Note related to Spark on YARN or files Using Input and Output (I/O) (HDFS). I don’t
really know what it could be, though. Perhaps nothing at all. Just saying.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.
HDFS (Hadoop Distributed File System) is a distributed file system designed to run
on commodity hardware. It is a data storage with files split across a cluster.
Currently, it’s more about the ecosystem of solutions that all use Hadoop infrastructure
for their work.
Yahoo has progressively invested in building and scaling Apache Hadoop clusters with
a current footprint of more than 40,000 servers and 600 petabytes of storage spread
across 19 clusters.
Deep learning can be defined as first-class steps in Apache Oozie workflows with
Hadoop for data processing and Spark pipelines for machine learning.
You can find some preliminary information about Spark pipelines for machine learning in
the chapter ML Pipelines.
HDFS provides fast analytics – scanning over large amounts of data very quickly, but it was
not built to handle updates. If data changed, it would need to be appended in bulk after a
certain volume or time interval, preventing real-time visibility into this data.
HBase complements HDFS’ capabilities by providing fast and random reads and writes
and supporting updating data, i.e. serving small queries extremely quickly, and allowing
data to be updated in place.
From How does partitioning work for data from files on HDFS?:
When Spark reads a file from HDFS, it creates a single partition for a single input split.
Input split is set by the Hadoop InputFormat used to read this file. For instance, if you
use textFile() it would be TextInputFormat in Hadoop, which would return you a
single partition for a single block of HDFS (but the split between partitions would be
done on line split, not the exact block split), unless you have a compressed text file. In
case of compressed file you would get a single partition for a single file (as compressed
text files are not splittable).
If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS
block size setting (128MB) it would be stored in 235 blocks, which means that the RDD
you read from this file would have 235 partitions. When you call repartition(1000) your
RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000
partitions only when you will execute an action on top of this RDD (lazy execution
concept)
With HDFS you can store any data (regardless of format and size). It can easily handle
unstructured data like video or other binary files as well as semi- or fully-structured data
like CSV files or databases.
There is the concept of data lake that is a huge data repository to support analytics.
HDFS partition files into so called splits and distributes them across multiple nodes in a
cluster to achieve fail-over and resiliency.
Further reading
Introducing Kudu: The New Hadoop Storage Engine for Fast Analytics on Fast Data
Tachyon is designed to function as a software file system that is compatible with the
HDFS interface prevalent in the big data analytics space. The point of doing this is that
it can be used as a drop in accelerator rather than having to adapt each framework to
use a distributed caching layer explicitly.
Note I’m going to keep the noise (enterprisey adornments) to the very minimum.
Apache Twill is an abstraction over Apache Hadoop YARN that allows you to use
YARN’s distributed capabilities with a programming model that is similar to running
threads.
In the comments to the article, some people announced their plans of using it with AWS
GPU cluster.
Spark Packages
Spark Packages is a community index of packages for Apache Spark.
Spark Packages is a community site hosting modules that are not part of Apache Spark. It
offers packages for reading different files formats (than those natively supported by Spark)
or from NoSQL databases like Cassandra, code testing, etc.
When you want to include a Spark package in your application, you should be using --
packages command line option.
command is printed out to the standard error output, i.e. System.err , or not.
All the Spark shell scripts use org.apache.spark.launcher.Main class internally that checks
SPARK_PRINT_LAUNCH_COMMAND and when set (to any value) will print out the entire command
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/
========================================
scala> sc.version
res0: String = 1.6.0-SNAPSHOT
scala> org.apache.spark.SPARK_VERSION
res1: String = 1.6.0-SNAPSHOT
You may see the following WARN messages in the logs when Spark finished the resolving
process:
Open spark-shell and execute :paste -raw that allows you to enter any valid Scala code,
including package .
package org.apache.spark
object spark {
def test = {
import org.apache.spark.scheduler._
println(DAGScheduler.RESUBMIT_TIMEOUT == 200)
}
}
scala> spark.test
true
scala> sc.version
res0: String = 1.6.0-SNAPSHOT
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
Further reading
Job aborted due to stage failure: Task not serializable
The issue is due to the way Hive works on Windows. You need no changes if
Note
you need no Hive integration in Spark SQL.
15/01/29 17:21:27 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
You need to have Administrator rights on your laptop. All the following
Note commands must be executed in a command-line window ( cmd ) ran as
Administrator, i.e. using Run As Administrator option while executing cmd .
set HADOOP_HOME=c:\hadoop
set PATH=%HADOOP_HOME%\bin;%PATH%
winutils.exe ls \tmp\hive
Exercises
Here I’m collecting exercises that aim at strengthening your understanding of Apache Spark.
Exercises 624
Mastering Apache Spark
Exercise
How would you go about solving a requirement to pair elements of the same key and
creating a new RDD out of the matched values?
val users = Seq((1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3,"user4"
// Input RDD
val us = sc.parallelize(users)
// Desired output
Seq("user1","user2"),("user1","user3"),("user1","user4"),("user2","user4"))
scala> r1.partitions.size
res63: Int = 16
When you execute r1.take(1) only one job gets run since it is enough to compute one task
on one partition.
However, when you execute r1.take(2) two jobs get run as the implementation assumes
one job with one partition, and if the elements didn’t total to the number of elements
requested in take , quadruple the partitions to work on in the following jobs.
Can you guess how many jobs are run for r1.take(15) ? How many tasks per job?
Answer: 3.
Start ZooKeeper.
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=<zookeeper_host>:2181
spark.deploy.zookeeper.dir=/spark
$ cp ./sbin/start-master{,-2}.sh
You can check how many instances you’re currently running using jps command as
follows:
$ jps -lm
5024 sun.tools.jps.Jps -lm
4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localho
4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localho
4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties
./sbin/start-slave.sh spark://localhost:7077,localhost:17077
Find out which standalone Master is active (there can only be one). Kill it. Observe how the
other standalone Master takes over and lets the Spark shell register with itself. Check out
the master’s UI.
Optionally, kill the worker, make sure it goes away instantly in the active master’s logs.
In the following example you’re going to count the words in README.md file that sits in your
Spark distribution and save the result under README.count directory.
You’re going to use the Spark shell for the example. Execute spark-shell .
wc.saveAsTextFile("README.count") (4)
1. Read the text file - refer to Using Input and Output (I/O).
3. Map each word into a pair and count them by word (key).
After you have executed the example, see the contents of the README.count directory:
$ ls -lt README.count
total 16
-rw-r--r-- 1 jacek staff 0 9 paź 13:36 _SUCCESS
-rw-r--r-- 1 jacek staff 1963 9 paź 13:36 part-00000
-rw-r--r-- 1 jacek staff 1663 9 paź 13:36 part-00001
The files part-0000x contain the pairs of word and the count.
$ cat README.count/part-00000
(package,1)
(this,1)
(Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1)
(Because,1)
(Python,2)
(cluster.,1)
(its,1)
([run,1)
...
Further (self-)development
Please read the questions and give answers first before looking at the link given.
Overview
You’re going to use sbt as the project build tool. It uses build.sbt for the project’s
description as well as the dependencies, i.e. the version of Apache Spark and others.
With the files in a directory, executing sbt package results in a package that can be
deployed onto a Spark cluster using spark-submit .
build.sbt
scalaVersion := "2.11.7"
SparkMe Application
Your first complete Spark application (using Scala and sbt) 633
Mastering Apache Spark
The application uses a single command-line parameter (as args(0) ) that is the file to
process. The file is read and the number of lines printed out.
package pl.japila.spark
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkMe Application")
val sc = new SparkContext(conf)
val c = lines.count()
println(s"There are $c lines in $fileName")
}
}
sbt.version=0.13.9
With the file the build is more predictable as the version of sbt doesn’t depend on
Tip
the sbt launcher.
Packaging Application
Execute sbt package to package the application.
The application uses only classes that comes with Spark so package is enough.
Your first complete Spark application (using Scala and sbt) 634
Mastering Apache Spark
spark-submit the SparkMe application and specify the file to process (as it is the only and
build.sbt is sbt’s build definition and is only used as an input file for
Note
demonstration purposes. Any file is going to work fine.
Your first complete Spark application (using Scala and sbt) 635
Mastering Apache Spark
Technology "things":
Spark Streaming on Hadoop YARN cluster processing messages from Apache Kafka
using the new direct API.
Business "things":
Predictive Analytics = Manage risk and capture new business opportunities with real-
time analytics and probabilistic forecasting of customers, products and partners.
data lakes, clickstream analytics, real time analytics, and data warehousing on Hadoop
With transactional tables in Hive together with insert, update, delete, it does the
"concatenate " for you automatically in regularly intervals. Currently this works only
with tables in orc.format (stored as orc)
Hive was originally not designed for updates, because it was.purely warehouse
focused, the most recent one can do updates, deletes etc in a transactional way.
Criteria:
Spark Streaming jobs are receiving a lot of small events (avg 10kb)
Using Spark SQL to update data in Hive using ORC files 637
Mastering Apache Spark
Requirements
1. Typesafe Activator
Add the following line to build.sbt (the main configuration file for the sbt project) that adds
the dependency on Spark 1.5.1. Note the double % that are to select the proper version of
the dependency for Scala 2.11.7.
$ mkdir -p src/main/scala/pl/japila/spark
package pl.japila.spark
[custom-spark-listener]> package
[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/custom-spark-listener/target/scala-2.11/c
[info] Packaging /Users/jacek/dev/sandbox/custom-spark-listener/target/scala-2.11/custom-spark-listen
[info] Done packaging.
[success] Total time: 0 s, completed Nov 4, 2015 8:59:30 AM
You should have the result jar file with the custom scheduler listener ready (mine is
/Users/jacek/dev/sandbox/custom-spark-listener/target/scala-2.11/custom-spark-
listener_2.11-1.0.jar )
The last line that starts with Job started: is from the custom Spark listener you’ve just
created. Congratulations! The exercise’s over.
Questions
1. What are the pros and cons of using the command line version vs inside a Spark
application?
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scal
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scal
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scal
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:635)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:563)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:802)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:836)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:694)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:404)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcZ$sp(SparkILoop.scala:
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:38)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:213)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:922)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:9
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
at org.apache.spark.repl.Main$.main(Main.scala:49)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Execute the command to have the jar downloaded into ~/.ivy2/jars directory by
spark-shell :
Start ./bin/spark-shell with --driver-class-path command line option and the driver jar.
It will give you the proper setup for accessing PostgreSQL using the JDBC driver.
scala> df.show(false)
+---+------------+-----------------------+
|id |name |website |
+---+------------+-----------------------+
|1 |Apache Spark|http://spark.apache.org|
|2 |Apache Hive |http://hive.apache.org |
|3 |Apache Kafka|http://kafka.apache.org|
|4 |Apache Flink|http://flink.apache.org|
+---+------------+-----------------------+
Troubleshooting
If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and
their solutions.
PostgreSQL Setup
Installation
Create Database
Accessing Database
Creating Table
Dropping Database
Installation
This page serves as a cheatsheet for the author so he does not have to
Caution
search Internet to find the installation steps.
Note Consult 17.3. Starting the Database Server in the official documentation.
$ postgres -D /usr/local/var/postgres
Create Database
$ createdb sparkdb
Accessing Database
Use psql sparkdb to access the database.
$ psql sparkdb
psql (9.5.2)
Type "help" for help.
sparkdb=#
Execute SELECT version() to know the version of the database server you have connected
to.
Creating Table
Create a table using CREATE TABLE command.
Execute select * from projects; to ensure that you have the following records in
projects table:
Dropping Database
$ dropdb sparkdb
Spark courses
Spark Fundamentals I from Big Data University.
Courses 649
Mastering Apache Spark
Books
O’Reilly
Manning
Packt
Spark Cookbook
Apress
Books 650
Mastering Apache Spark
Commercial Products
Spark has reached the point where companies around the world adopt it to build their own
solutions on top of it.
In IBM Analytics for Apache Spark you can use IBM SystemML machine learning technology
and run it on IBM Bluemix.
IBM Cloud relies on OpenStack Swift as Data Storage for the service.
Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig, and Hive
service designed to easily and cost effectively process big datasets. You can quickly
create managed clusters of any size and turn them off when you are finished, so you
only pay for what you need. Cloud Dataproc is integrated across several Google Cloud
Platform products, so you have access to a simple, powerful, and complete data
processing platform.
1. Requirements
2. Day 1
3. Day 2
3. The latest release of Apache Spark pre-built for Hadoop 2.6 and later from Download
Spark.
Requirements 655
Mastering Apache Spark
5. Developing Spark applications using Scala and sbt and deploying to the Spark
Standalone cluster - 2 x 45 mins
Day 1 656
Mastering Apache Spark
Day 2 657
Mastering Apache Spark
Duration: …FIXME
REST Server
Read REST Server.
spark-shell is spark-submit
Read Spark shell.
You may also make it a little bit heavier with explaining data distribution over cluster and go
over the concepts of drivers, masters, workers, and executors.
Glossary
FIXME
A place that begs for more relevant content.
Glossary 662
Mastering Apache Spark
Glossary 663