Apache Spark Graph Processing - Sample Chapter
Apache Spark Graph Processing - Sample Chapter
ee
This book will teach you the techniques and recipes for
large-scale graph processing using Apache Spark. It is a
step-by-step and detailed guide that will prove essential
to anyone with an interest in, and need to process,
large graphs.
This book is a step-by-step guide where each chapter
teaches important techniques for different stages of the
pipeline, from loading and transforming graph data to
implementing graph-parallel operations and machine
learning algorithms. It also uses a hands-on approach. We
show how each technique works, using the Scala REPL with
simple examples, and build standalone Spark applications.
The book has detailed, complete, and tested Scala codes.
$ 34.99 US
22.99 UK
C o m m u n i t y
Rindra Ramamonjison
P U B L I S H I N G
pl
Apache Spark
Graph Processing
Sa
m
E x p e r i e n c e
D i s t i l l e d
Apache Spark
Graph Processing
Build, process, and analyze large-scale graphs with Spark
Foreword by Denny Lee, Technology Evangelist, Databricks
Advisor, WearHacks
Rindra Ramamonjison
the University of British Columbia, Vancouver. He received his master's degree from
Tokyo Institute of Technology. He has played various roles in many engineering
companies, within telecom and finance industries. His primary research interests are
machine learning, optimization, graph processing, and statistical signal processing.
Rindra is also the co-organizer of the Vancouver Spark Meetup.
Preface
Preface
This book is intended to present the GraphX library for Apache Spark and to teach
the fundamental techniques and recipes to process graph data at scale. It is intended
to be a self-study step-by-step guide for anyone new to Spark with an interest in or
need for large-scale graph processing.
Distinctive features
The focus of this book is on large-scale graph processing with Apache Spark. The
book teaches a variety of graph processing abstractions and algorithms and provides
concise and sufficient information about them. You can confidently learn all of it and
put it to use in different applications.
Hands-on approach: We show how each technique works using the Scala
REPL with simple examples and by building standalone Spark applications.
Detailed code: All the Scala code in the book is available for download from
the book webpage of Packt Publishing.
Preface
Experiment with the Spark shell and review Spark's data abstractions
Create a graph and explore the links using base RDD and graph operations
[1]
Make sure you install JDK instead of Java Runtime Environment (JRE). You
can download it from http://www.oracle.com/technetwork/java/javase/
downloads/jdk7-downloads-1880260.html.
Next, download the latest release of Spark from the project website
https://spark.apache.org/downloads.html. Perform the following
three steps to get Spark installed on your computer:
1. Select the package type: Pre-built for Hadoop 2.6 and later and then Direct
Download. Make sure you choose a prebuilt version for Hadoop instead of
the source code.
2. Download the compressed TAR file called spark-1.4.1-bin-hadoop2.6.tgz
and place it into a directory on your computer.
3. Open the terminal and change to the previous directory. Using the
following commands, extract the TAR file, rename the Spark root folder
to spark-1.4.1, and then list the installed files and subdirectories:
tar -xf spark-1.4.1-bin-hadoop2.6.tgz
mv spark-1.4.1-bin-hadoop2.6 spark-1.4.1
cd spark-1.4.1
ls
That's it! You now have Spark and its libraries installed on your computer. Note the
following files and directories in the spark-1.4.1 home folder:
core: This directory contains the source code for the core components and
API of Spark
bin: This directory contains the executable files that are used to submit and
graphx, mllib, sql, and streaming: These are Spark libraries that provide
a unified interface to do different types of data processing, namely graph
processing, machine learning, queries, and stream processing
It is often convenient to create shortcuts to the Spark home folder and Spark example
folders. In Linux or Mac, open or create the ~/.bash_profile file in your home
folder and insert the following lines:
export SPARKHOME="/[Where you put Spark]/spark-1.4.1/"
export SPARKSCALAEX="ls ../spark1.4.1/examples/src/main/scala/org/apache/spark/examples/"
[2]
Chapter 1
Then, execute the following command for the previous shortcuts to take effect:
source ~/.bash_profile
As a result, you can quickly access these folders in the terminal or Spark shell. For
example, the example named LiveJournalPageRank.scala can be accessed with:
$SPARKSCALAEX/graphx/LiveJournalPageRank.scala
If you set the current directory (cd) to $SPARKHOME, you can simply launch the
shell with:
cd $SPARKHOME
./bin/spark-shell
If you were successful in launching the Spark shell, you should see the welcome
message like this:
Welcome to
____
/ __/__
__
___ _____/ /__
_\ \/ _ \/ _ '/ __/
'_/
version 1.4.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java)
[3]
For a sanity check, you can type in some Scala expressions or declarations and have
them evaluated. Let's type some commands into the shell now:
scala> sc
res1: org.apache.spark.SparkContext = org.apache.spark.
SparkContext@52e52233
scala> val myRDD = sc.parallelize(List(1,2,3,4,5))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at <console>:12
scala> sc.textFile("README.md").filter(line => line contains "Spark").
count()
res2: Long = 21
Here is what you can tell about the preceding code. First, we displayed the Spark
context defined by the variable sc, which is automatically created when you launch
the Spark shell. The Spark context is the point of entry to the Spark API. Second,
we created an RDD named myRDD that was obtained by calling the parallelize
function for a list of five numbers. Finally, we loaded the README.md file into an
RDD, filtered the lines that contain the word "Spark", and finally invoked an action
on the filtered RDD to count the number of those lines.
It is expected that you are already familiar with the basic RDD
transformations and actions, such as map, reduce, and filter.
If that is not the case, I recommend that you learn them first,
perhaps by reading the programming guide at https://spark.
apache.org/docs/latest/programming-guide.html or an
introductory book such as Fast Data Processing with Spark by Packt
Publishing and Learning Spark by O'Reilly Media.
Don't panic if you did not fully grasp the mechanisms behind RDDs. The following
refresher, however, helps you to remember the important points. RDD is the core
data abstraction in Spark to represent a distributed collection of large datasets that
can be partitioned and processed in parallel across a cluster of machines. The Spark
API provides a uniform set of operations to transform and reduce the data within an
RDD. On top of these abstractions and operations, the GraphX library also offers a
flexible API that enables us to create graphs and operate on them easily.
Perhaps, when you ran the preceding commands in the Spark shell, you were
overwhelmed by the long list of logging statements that start with INFO. There is a
way to reduce the amount of information that Spark outputs in the shell.
[4]
Chapter 1
You can reduce the level of verbosity of the Spark shell as follows:
log4j.rootCategory=WARN, console
log4j.rootCategory=ERROR, console
Finally, restart the Spark shell and you should now see fewer
logging messages in the shell outputs
[5]
As said previously, SparkContext is the main point of entry into a Spark program
and it is created automatically in the Spark shell. It also offers useful methods to
create RDDs from local collections, to load data from a local or Hadoop file system
into RDDs, and to save output data on disks.
Loading the CSV files just gave us back two RDDs of strings. To create our graph,
we need to parse these strings into two suitable collections of vertices and edges.
It is important that your current directory inside the shell
is $SPARKHOME. Otherwise, you get an error later because
Spark cannot find the files.
[6]
Chapter 1
This means that the Graph class provides getters to access its vertices and
its edges. These are later abstracted by the RDD subclasses VertexRDD[VD] and
EdgeRDD[ED, VD]. Note that VD and ED here denote some Scala-type parameters
of the classes VertexRDD, EdgeRDD, and Graph. These types of parameters can be
primitive types, such as String, or also user-defined classes, such as the Person
class, in our example of a social graph. It is important to note that the property
graph in Spark is a directed multigraph. It means that the graph is permitted to have
multiple edges between any pair of vertices. Moreover, each edge is directed and
defines a unidirectional relationship. This is easy to grasp, for instance, in a Twitter
graph where a user can follow another one but the converse does not need to be
true. To model bidirectional links, such as a Facebook friendship, we need to define
two edges between the nodes, and these edges should point in opposite directions.
Additional properties about the relationship can be stored as an attribute of the edge.
A property graph is a graph with user-defined objects attached
to each vertex and edge. The classes of these objects describe the
properties of the graph. This is done in practice by parameterizing
the class Graph, VertexRDD, and EdgeRDD. Moreover, each edge
of the graph defines a unidirectional relationship but multiple edges
can exist between any pair of vertices.
2. Next, we are going to parse each line of the CSV texts inside people and links
into new objects of type Person and Edge respectively, and collect the results
in RDD[(VertexId, Person)] and RDD[Edge[String]]:
val peopleRDD: RDD[(VertexId, Person)] = people map { line
=>
val row = line split ','
(row(0).toInt, Person(row(1), row(2).toInt))
}
[7]
3. Now, we can create our social graph and name it tinySocial using the
factory method Graph():
scala> val tinySocial: Graph[Person, Connection] =
Graph(peopleRDD, linksRDD)
tinySocial: org.apache.spark.graphx.Graph[Person,Connection] =
org.apache.spark.graphx.impl.GraphImpl@128cd92a
There are two things to note about this constructor. I told you earlier that
the member vertices and edges of the graph are instances of VertexRDD[VD]
and EdgeRDD[ED,VD]. However, we passed RDD[(VertexId, Person)] and
RDD[Edge[Connection]] into the above factory method Graph. How did that
work? It worked because VertexRDD[VD] and EdgeRDD[ED,VD] are subclasses of
RDD[(VertexId, Person)] and RDD[Edge[Connection]] respectively. In addition,
VertexRDD[VD] adds the constraint that VertexID occurs only once. Basically,
two people in our social network cannot have the same vertex ID. Furthermore,
VertexRDD[VD] and EdgeRDD[ED,VD] provide several other operations to transform
vertex and edge attributes. We will see more of these in later chapters.
[8]
Chapter 1
We used the edges and vertices getters in the Graph class and used the RDD
action collect to put the result into a local collection.
Now, suppose we want to print only the professional connections that are listed in
the following profLinks list:
val profLinks: List[Connection] = List("Coworker", "Boss",
"Employee","Client", "Supplier")
A bad way to arrive at the desired result is to filter the edges corresponding
to professional connections, then loop through the filtered edges, extract the
corresponding vertices' names, and print the connections between the source
and destination vertices. We can write that method in the following code:
val profNetwork =
tinySocial.edges.filter{ case Edge(_,_,link) =>
profLinks.contains(link)}
for {
Edge(src, dst, link) <- profNetwork.collect()
srcName = (peopleRDD.filter{case (id, person) => id == src}
first)._2.name
dstName = (peopleRDD.filter{case (id, person) => id == dst}
first)._2.name
} println(srcName + " is a " + link + " of " + dstName)
Charlie is a boss of Bob
Dave is a client of Eve
George is a coworker of Ivy
[9]
There are two problems with the preceding code. First, it could be more concise
and expressive. Second, it is not efficient due to the filtering operations inside the
for loop.
Luckily, there is a better alternative. The GraphX library provides two different ways
to view data: either as a graph or as tables of edges, vertices, and triplets. For each
view, the library offers a rich set operations whose implementations are optimized
for execution. That means that we can often process a graph using a predefined
graph operation or algorithm, easily. For instance, we could simplify the previous
code and make it more efficient, as follows:
tinySocial.subgraph(profLinks contains _.attr).
triplets.foreach(t => println(t.srcAttr.name + " is a " +
t.attr + " of " + t.dstAttr.name))
Charlie is a boss of Bob
Dave is a client of Eve
George is a coworker of Ivy
[ 10 ]
Chapter 1
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._
object SimpleGraphApp {
def main(args: Array[String]){
// Configure the program
val conf = new SparkConf()
.setAppName("Tiny Social")
.setMaster("local")
.set("spark.driver.memory", "2G")
val sc = new SparkContext(conf)
// Load some data into RDDs
val people = sc.textFile("./data/people.csv")
val links = sc.textFile("./data/links.csv")
// After that, we use the Spark API as in the shell
// ...
}
}
You can also see the entire code of our SimpleGraph.scala file in the example files,
which you can download from the Packt website.
Downloading the example code
You can download the example code files from your account at
http://www.packtpub.com for all the Packt Publishing books
you have purchased. If you purchased this book elsewhere, you
can visit http://www.packtpub.com/support and register to
have the files e-mailed directly to you.
Let's go over this code to understand what is required to create and configure a
Spark standalone program in Scala.
As a Scala program, our Spark application should be constructed within a
top-level Scala object, which must have a main function that has the signature:
def main(args: Array[String]): Unit. In other words, the main program
accepts an array of strings as a parameter and returns nothing. In our example,
the top-level object is SimpleGraphApp.
[ 11 ]
The first two lines import the SparkContext class as well as some implicit
conversions defined in its companion object. It is not very important to know
what the implicit conversions are. Just make sure you import both SparkContext
and SparContext._
When we worked in the Spark shell, SparkContext and
SparContext._ were imported automatically for us.
The third line imports SparkConf, which is a wrapper class that contains the
configuration settings of a Spark application, such as its application name, the
memory size of each executor, and the address of the master or cluster manager.
Next, we have imported some RDD and GraphX-specific class constructors and
operators with these lines:
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._
The underscore after org.apache.spark.graphx makes sure that all public APIs in
GraphX get imported.
Within main, we had to first configure the Spark program. To do this, we created
an object called SparkConf and set the application settings through a chain of setter
methods on the SparkConf object. SparkConf provides specific setters for some
common properties, such as the application name or master. Alternatively, a generic
set method can be used to set multiple properties together by passing them as a
sequence of key-value pairs. The most common configuration parameters are listed
in the following table with their default values and usage. The extensive list can be
found at https://spark.apache.org/docs/latest/configuration.html:
Spark property name
spark.app.name
[ 12 ]
Chapter 1
spark.master
spark.executor.memory
spark.driver.memory
spark.storage.
memoryFraction
spark.serializer
Precisely, we set the name of our application to "Tiny Social" and the master to
be the local machine on which we submit the application. In addition, the driver
memory is set to 2 GB. Should we have set the master to be a cluster instead of local,
we can specify the memory per executor by setting spark.executor.memory instead
of spark.driver.memory.
In principle, the driver and executor have different roles and, in
general, they run on different processes except when we set the master
to be local. The driver is the process that compiles our program into
tasks, schedules these tasks to one of more executors, and maintains
the physical location of every RDD. Each executor is responsible for
executing the tasks, and storing and caching RDDs in memory.
[ 13 ]
It is not mandatory to set the Spark application settings in the SparkConf object
inside your program. Alternatively, when submitting our application, we could
set these parameters as command-line options of the spark-submit tool. We will
cover that part in detail in the following sections. In this case, we will just create our
SparkContext object as:
val sc = new SparkContext(new SparkConf())
After configuring the program, the next task is to load the data that we want to
process by calling utility methods such as sc.textFile on the SparkContext
object sc:
val people = sc.textFile("./data/people.csv")
val links = sc.textFile("./data/links.csv")
Finally, the rest of the program consists of the same operations on RDDs and
graphs that we have used in the shell.
To avoid confusion when passing a relative file path to I/O actions
such as sc.textFile(), the convention used in this book is
that the current directory of the command line is always set to the
project root folder. For instance, if our Tiny Social app's root folder
is $SPARKHOME/exercises/chap1, then Spark will look for the
data to be loaded in $SPARKHOME/exercises/chap1/data. This
assumes that we have put the files in that data folder.
Once we have SBT properly installed, we can use it to build our application with
all its dependencies inside a single JAR package file, also called uber jar. In fact,
when running a Spark application on several worker machines, an error will
occur if some machines do not have the right dependency JAR.
[ 14 ]
Chapter 1
By packaging an uber jar with SBT, the application code and its dependencies are
all distributed to the workers. Concretely, we need to create a build definition file
in which we set the project settings. Moreover, we must specify the dependencies
and the resolvers that help SBT find the packages that are needed by our program.
A resolver indicates the name and location of the repository that has the required
JAR file. Let's create a file called build.sbt in the project root folder $SPARKHOME/
exercises/chap1 and insert the following lines:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1",
"org.apache.spark" %% "spark-graphx" % "1.4.1"
)
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
By convention, the settings are defined by Scala expressions and they need to be
delimited by blank lines. Earlier, we set the project name, its version number, the
version of Scala, as well as the Spark library dependencies. To build the program,
we then enter the command:
$ sbt package
The required options for the spark-submit command are the Scala application
object name and the JAR file that we previously built with SBT. In case we did not
set the master setting when creating the SparkConf object, we also would have to
specify the --master option in spark-submit.
You can list all the available options for the spark-submit script
with the command:
spark-submit --help
Summary
In this chapter, we took a whirlwind tour of graph processing in Spark. Specifically,
we installed the Java Development Kit, a prebuilt version of Spark and the SBT tool.
Furthermore, you were introduced to graph abstraction and operations in Spark by
creating a social network in the Spark shell and also in a standalone program.
In the next chapter, you will learn more about how to build and explore graphs
in Spark.
[ 16 ]
www.PacktPub.com
Stay Connected: