MODULE 5
APACHE PIG AND APACHE SPARK
Pig
Apache Pig is a high-level platform for processing large datasets, built on top of Apache
Hadoop.
Pig is madeup of two pieces:
1. The language used to express data flows, called Pig Latin.
2. The execution environment to run Pig Latin Programs.
There are currently two environments: Local execution in a single JVM and distributed
execution on a Hadoop cluster.
Pig Latin is a data flow language, meaning it is used to define a sequence of operations that
will be performed on data, such as loading, transforming, filtering, grouping, and storing, that
are applied to the input data to produce output.
Key features of Pig:
1. Ease of use: Pig Latin is similar to SQL, making it accessible for those familiar with
relational databases.
2. Data Flow: it allows users to express data transformations and analyses in a data flow
manner.
3. Extensibility: Users can write custom functions (UDFs) in Java, Python or JavaScript.
4. Optimization opportunities: Pig optimizes the execution of tasks automatically, allowing
for better performance.
Use Cases:
ETL (Extract, Transform, Load).
Data analysis and reporting.
Batch Processing of Large datasets.
1) Installing and Running Pig
Pig runs as a client-side application.
Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the
cluster.
Pig launches jobs and interacts with HDFS (or other Hadoop file systems) from your
workstation.
Try typing pig -help to get usage instructions.
2) Execution Types
Pig has two execution types or modes: local mode and Map Reduce mode.
Local mode
In local mode, Pig runs in a single JVM and accesses the local file system. This mode is
suitable only for small datasets and when trying out Pig.
The execution type is set using the -x or –exec type option. To run in local mode, set the
option to local:
% pig -x local
grunt>
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster.
To use MapReduce mode, you first need to check that the version of Pig you down- loaded is
compatible with the version of Hadoop you are using.
3) Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Script: Pig can run a script file that contains Pig commands.
Grunt: Grunt is an interactive shell for running Pig commands. Grunt is started when no file
is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts
from within Grunt using run and exec.
Embedded: You can run Pig programs from Java using the Pig Server class. For
programmatic access to Grunt, use Pig Runner.
Grunt
a JavaScript task runner, a tool used to automatically perform frequent tasks such as
minification, compilation, unit testing, and linting
Grunt is primarily used to automate tasks that need to be performed routinely.
Pig Latin Editors
PigPen is an Eclipse plug-in that provides an environment for developing Pig programs. It
includes a Pig script text editor,
Comparison with Databases
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyse the data in Hadoop.
pache Pig includes many built-in operators to help with data operations such as joins, filters,
sorting, etc. Furthermore, it adds nested data types such as tuples, bags, and maps that
MapReduce lacks.
Structure: A Pig Latin program consists of a collection of statements. A statement can be
thought of as an operation, or a command.5 For example, a GROUP operation is a type of
statement:
grouped_records = GROUP records BY year;
The command to list the files in a Hadoop filesystem is another example of a statement:
ls /
Statements are usually terminated with a semicolon, C-style comments are more flexible
since they delimit the beginning and end of the comment block with /* and */ markers.
/*
* Description of my program spanning
* multiple lines.
*/
Statements: As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors, or other (semantic) problems such as undefined aliases, the interpreter will halt
and display an error message.
The interpreter builds a logical plan for every relational operation, When the Pig Latin
interpreter sees the first line containing the LOAD statement, it confirms that it is
syntactically and semantically correct, and adds it to the logical plan.
Expressions
Pig has a rich variety of expressions, many of which will be familiar from other programming
languages.
Types
Pig has four numeric types: int, long, float, and double.
The numeric, textual, and binary types are simple atomic types.
Pig Latin also has three complex types for representing nested structures: tuple, bag, and
map
Functions
Functions in Pig come in four types:
i) Eval function: A function that takes one or more expressions and returns another
expression. An example of a built-in eval function is MAX, which returns the
maximum value of the entries in a bag.
ii) Filter function: A special type of eval function that returns a logical boolean
result. As the name suggests, filter functions are used in the FILTER operator to
remove unwanted rows. An example of a built-in filter function is IsEmpty, which
tests whether a bag or a map contains any items.
iii) Load function: A function that specifies how to load data into a relation from
external storage.
iv) Store function: A function that specifies how to save the contents of a relation to
external storage. Often, load and store functions are implemented by the same
type. For example, PigStorage, which loads data from delimited text files, can
store data in the same format.
Data Processing Operators
1. Loading and Storing Data: Here’s an example of using Pig Storage to store tuples as
plain-text values separated by a colon character:
grunt> STORE A INTO 'out' USING PigStorage(':');
grunt> cat out
Joe:cherry:2
Ali:apple:3
2. Grouping and Joining Data: Joining datasets in MapReduce takes some work on the part
of the programmer whereas Pig has very good built-in support for join operations, making it
much more approachable.
JOIN
Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
3. Sorting Data: Relations are unordered in Pig. Consider a relation A:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any order. If
you want to impose an order on the output, you can use the ORDER operator to sort a
relation by one or more fields.
The following example sorts A by the first field in ascending order and by the second
field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
4. Combining and Splitting Data: Sometimes you have several relations that you would like
to combine into one. For this,the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)
C is the union of relations A and B, and since relations are unordered, the order of the
tuples in C is undefined.
APACHE SPARK
Apache Spark
Apache Spark is an open-source, distributed computing framework that provides a fast and
efficient solution for large scale data processing.
Spark is designed to handle big data processing tasks by utilizing a cluster of computers,
reducing processing time compared to traditional data processing frameworks.
It can handle up to petabytes (that’s millions of gigabytes!) of data, and manage up to
thousands of physical or virtual machines.
Why spark ?
Speed: It uses an in-memory data storage mechanism, which enables faster processing of
data compared to traditional disk-based data processing frameworks like Hadoop
MapReduce.
Scalability: Spark can be easily scaled to handle growing data volume and processing
demands by adding more nodes to the cluster.
Flexibility: Spark provides high-level APIs in multiple programming languages, including
Scala, Java, Python, and R, which makes it easier for developers to work with Spark,
regardless of their preferred programming language.
Ease of Use: Spark provides an intuitive and easy-to-use API that abstracts away the
complex details of distributed computing, making it easier for developers to build and
maintain data processing pipelines.
Integration: Spark integrates with popular big data tools, such as Hadoop, Hive, and HBase,
making it a popular choice for large scale data processing in the big data ecosystem.
Resilience: Spark uses the Resilient Distributed Datasets (RDD) concept, which provides
fault tolerance and the ability to recover from node failures.
Spark application
An Spark application is a program built with Spark APIs and runs in a Spark compatible
cluster/environment.
It can be a PySpark script, a Java application, a Scala application, a SparkSession started
by spark-shell or spark-sql command, a AWS EMR Step, etc.
A Spark application consists of a driver container and executors.
Example of SPARK in PYTHON:
This script will perform these processes:
1.Create a DataFrame named df1 by reading data from HDFS.
2.Create another DataFrame named df2 from another HDFS path.
3.Create a new DataFrame named df using inner join between df1 and df2 and then print it
out.
4.Finally save df into HDFS using Parquet format.
5.Run the application using the following command:
spark-submit spark-basic.py
Word Count Java example for Spark:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;
import java.util.Arrays;
public class SparkWordCount {
public static void main(String[] args) {
// Step 1: Set up Spark configuration and create a Spark context
SparkConf conf = new SparkConf().setAppName("Word Count").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
// Step 2: Load input data (a text file) into an RDD
String inputFile = "input.txt"; // Ensure input.txt exists in your file system
JavaRDD<String> inputData = sc.textFile(inputFile);
// Step 3: Split the lines into words (transformations)
JavaRDD<String>words=inputData.flatMap(line>Arrays.asList(line.split("")).iterator()
);
// Step 4: Map each word to a (word, 1) pair
JavaPairRDD<String,Integer>wordCounts=words.mapToPair(word>newTuple2<>(wor
d,1));
// Step 5: Reduce the pairs by key (word) to count the occurrences of each word
JavaPairRDD<String, Integer> counts = wordCounts.reduceByKey(Integer::sum);
// Step 6: Save the result to a text file
counts.saveAsTextFile("output"); // Output will be saved in "output" directory
// Step 7: Stop the Spark context
sc.stop();
}
}
Spark jobs and stages
A Spark job is a parallel computation of tasks. Each action operation will create one
Spark job.
Each Spark job will be converted to a DAG which includes one or more stages.
A Spark stage is a smaller sets of tasks that depend on each other.
Stages are created for each job based on shuffle boundaries, i.e. what operations can be
performed serially or in parallel.
Not all Spark operations or actions can happen in a single stage without data shuffling,
thus they may be divided into multiple stages.
For example, an operation involves data shuffling will lead to the creation of a new
stage.
If there is no data shuffling for the job, there is usually one stage created.
Spark tasks
A Spark task is a single unit of work or execution that runs in a Spark executor.
It is the parallelism unit in Spark. Each stage contains one or multiple tasks.
Each task is mapped to a single core and a partition in the dataset.
In the above example, each stage only has one task because the sample input data is
stored in one single small file in HDFS.
If you have a data input with 1000 partitions, then at least 1000 tasks will be created
for the operations.
Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark that
provide a fault-tolerant, distributed collection of objects.
Here’s an overview of RDDs, their characteristics, and some basic operations
Creating RDDs
You can create RDDs in several ways:
1. From an existing collection:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
2. From a data file:
rdd = spark.sparkContext.textFile
Basic Operations
Transformations:
Transformations create a new RDD from an existing one. Common transformations include:
1. Map: Applies a function to each element.
squared_rdd = rdd.map(lambda x: x * x)
2. Filter: Returns a new RDD containing only the elements that satisfy a condition.
even_rdd = rdd.filter(lambda x: x % 2 == 0)
3. FlatMap: Similar to map, but each input element can produce multiple output elements.
flat_mapped_rdd = rdd.flatMap(lambda x: (x, x + 1))
4. Union: Combines two RDDs
union_rdd = rdd.union(another_rdd)
Actions
Actions trigger the execution of transformations and return results. Common actions include:
1.Collect: Returns all elements of the RDD to the driver program.
result = squared_rdd.collect()
2.Count: Returns the number of elements in the RDD.
count = rdd.count()
3.First: Returns the first element of the RDD.
first_element = rdd.first()
4.Take: Returns the first n elements.
first_three = rdd.take(3)
Persistence in Spark
In Apache Spark, persistence (also known as caching) allows you to store intermediate
results of an RDD (Resilient Distributed Dataset) or DataFrame in memory (or disk) for
faster access in subsequent operations.
By default, Spark computes RDDs and DataFrames from scratch every time an action is
called.
However, if the same dataset is used multiple times in your program, you can persist it to
avoid recomputation, thereby improving performance.
Types of Persistence Levels:
Spark provides various persistence levels depending on where you want to store the data:
1. MEMORY_ONLY: Stores RDD or DataFrame in memory as deserialized Java
objects. If the data does not fit in memory, Spark will recompute it whenever
required.
2. MEMORY_AND_DISK: Stores data in memory as deserialized Java objects. If the
data does not fit in memory, it stores the rest on disk. When Spark needs the data, it
reads from the disk if necessary.
3. MEMORY_ONLY_SER: Similar to MEMORY_ONLY, but the data is stored in
serialized format (compressed), which reduces memory usage at the cost of increased
CPU time for serialization and deserialization.
4. MEMORY_AND_DISK_SER: Similar to MEMORY_AND_DISK, but the data is
stored in serialized format both in memory and on disk.
5. DISK_ONLY: Stores the data only on disk, which avoids memory consumption, but
accessing data from disk is slower.
6. OFF_HEAP: Stores the data off the Java heap (in memory outside of the Java Virtual
Machine) when MEMORY_AND_DISK_SER is used with off-heap settings enabled.
Using cache( ) and persist( )
cache( ): Equivalent to persist(Storage Level. MEMORY_ONLY). It stores the data
in memory only.
persist(): Allows you to specify the storage level. It is more flexible than cache().
Persistence is a key feature in Apache Spark that allows you to optimize your application by
storing intermediate results either in memory, disk, or a combination of both.
By using persist() and cache() intelligently, you can greatly improve the performance of
applications that reuse data across multiple actions or have expensive computations.
Serialization
Serialization in Apache Spark applies to both data and functions (closures) because Spark
operates in a distributed environment.
When data or functions are sent from the driver node to the worker nodes for distributed
processing, Spark serializes them, ensuring they can be transferred across the network and
executed in a distributed fashion.
Let’s break down how serialization of data and functions works in Spark:
1. Serialization of Data in Spark
Data in Spark is distributed across the nodes in the cluster. To enable distributed processing,
Spark serializes the data that needs to be transferred over the network or stored on disk. This
includes:
RDDs: Resilient Distributed Datasets (RDDs) are the core data structure of Spark.
When RDDs are created, transformed, or shuffled across nodes, their data gets
serialized.
DataFrames and Datasets: Similarly, when operations are performed on DataFrames
or Datasets, the underlying data may need to be serialized, especially during shuffling
or caching.
2. Serialization of Functions in Spark
In Spark, functions (closures) need to be serialized because Spark runs these functions
on worker nodes, not on the driver.
When you write a function inside an RDD transformation (like map, filter, reduceByKey,
etc.), Spark must serialize this function (its closure) and send it to the worker nodes.
A closure in Spark refers to any variables, objects, or functions that are referenced within a
transformation or action, and must be serialized to run on the worker nodes.
Broadcast variable:
When you have a small lookup table that you need to join with a larger dataset, it can be
more efficient to broadcast the lookup table to all the executors rather than sending it over
the network for every join.
Here is a simple example of using broadcast variables in Spark
Anatomy of a Spark Job Run
The anatomy of a Spark job run involves multiple stages, from defining the job in the
application code to the execution of tasks across a distributed cluster.
At the highest level,there are two independent entities:
Driver: which hosts the application (Spark context) and schedules tasks for a job.
Executors: which are exclusive to the application, run for the duration of the
application, and execute the application’s tasks.
Job Submission:
Fig illustrates how Spark runs a job.
A Spark job is submitted automatically when an action(such as count()) is performed
on an RDD.
Step1: Internally this causes runjob() to be called on the SparkContext,which passes the call
on to the scheduler that runs as a part of the driver.
Step 2: The Scheduler is made up of two parts:a DAG scheduler that breaks down the job into
a DAG of stages, and a task scheduler that is responsible for submitting the tasks from each
stage to the cluster.
DAG
What is a DAG
A cyclical graph of stages. As soon as you apply transformations on rdds.
DAG creates a physical plan and shows the applied transformation of stages.
DAG Construction:
There are two types: shuffle map tasks and result tasks.
Shuffle map tasks:
As the name suggests, shuffle map tasks are like the map-side part of the shuffle in
MapReduce.
Each shuffle map task runs a computation on one RDD partition and,based on a portioning
function, writes its output to a new set of partitions,which are fetched to a later stage.
Shuffle map tasks run in all stages except the final stage.
Result tasks:
Result tasks run in the final stage that returns the result to the user’s program.
How its Working
Spark translates the RDD transformations into something called DAG (Directed Acyclic
Graph) and starts the execution,
In Spark, a job is divided into multiple stages, and each stage consists of a series of
transformations on the data. The transformations in a stage are executed in parallel, and the
stages are executed in a specific order, as determined by the dependencies between the stages.
At high level, when any action is called on the RDD, Spark creates the DAG and submits to
the DAG scheduler.
The DAG Scheduler divides the stages into number of tasks. A stage is comprised of
tasks based on partitions of the input data. The DAG scheduler pipelines operators
together. For e.g. Many map operators can be scheduled in a single stage. The final result
of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler. The Task Scheduler launches tasks via
cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about
dependencies of the stages.
The Worker executes the tasks on the Slave.
Task Scheduling
When the task scheduler receives the set of tasks, it uses its list of executors that are
running for the application and constructs a mapping of tasks to executors on the basis of
placement preference.
For a given executor, the scheduler will first assign process-local tasks, then node-local
task, then rack-local task, before assigning any arbitrary task or random task.
Executors send status update to the driver when a task is completed or has failed. In case
of task failure, task scheduler resubmits the task on another executor.
It also launches speculative tasks for tasks that are running slowly if this feature is
enabled.
Speculative tasks are duplicates of existing tasks, which the scheduler may run as a
backup if a task is running more slowly than expected.
Task Execution
Executor makes sure that the JAR and file dependencies are up to date.
It keeps local cache of dependencies from previous tasks.
It deserializes the task code (which consists of the user's functions) from the serialized
bytes that were sent as a part of the launch task message.
Finally the task code is executed.
Task can return a result to the driver.
The result is serialized and sent to executor backend, and finally to driver as status
update message.
Executors and Cluster Managers
Executors in Spark
An executor is a distributed process launched for a Spark application on a worker node in a
cluster. Executors are responsible for executing tasks and storing data for the Spark
application.
Cluster Managers in Spark
The cluster manager is responsible for managing resources and scheduling tasks across a
cluster.
It provides Spark with access to the computing resources (i.e., CPUs, memory) needed to run
Spark applications.
Spark supports multiple cluster managers, including:
1. Local
2. Standalone
3. Mesos
4. YARN
1. Local
In local mode there is a single executor running in the same JVM as the driver.
This mode is useful for testing or running small jobs.
The master URL for this mode is local(use one thread), local[n] (n threads), or
local(*).
2. Standalone Cluster Manager
In standalone mode, one of the nodes in the cluster acts as a master, and the rest act
as worker nodes.
The master coordinates the resources and schedules tasks on the workers.
The master URL is spark://host:port.
Suitable for small to medium-sized clusters.
3. Mesos
Apache Mesos is a cluster manager that can dynamically allocate resources across
multiple applications, not just Spark.
It is highly scalable and can handle clusters of thousands of nodes.
The master URL is mesos://host:port.
Spark can run on Mesos in either coarse-grained or fine-grained mode:
o Coarse-grained mode: Allocates a fixed number of resources to Spark for the
entire application runtime.
o Fine-grained mode: Dynamically allocates and deallocates resources during
the runtime, offering more flexibility but with some overhead.
4. YARN (Yet Another Resource Negotiator)
YARN is the resource manager used in Hadoop.
Each running Spark application corresponds to an instance of a YARN
application,and each executor runs in its own YARN container.
The master URL is yarn-client or yarn-cluster.
Spark on YARN
Running Spark on YARN provides the tightest integration with other Hadoop
components and is the most convenient way to use Spark when you have an existing
Hadoop Cluster.
Spark offers two deploy modes for running on YARN:
YARN Client mode, where the driver runs in the client.
YARN Cluster mode, where the driver runs on the cluster in the YARN application
master.
YARN Modes
1. YARN Client Mode: This mode is required for programs that have any interactive
component, such as spark-shell or pyspark.
Client mode is also useful when building Spark programs, since any debugging output is
immediately visible.
2. YARN Cluster Mode: YARN cluster mode, on the other hand, is appropriate for
production jobs, since the entire application runs on the cluster, which makes it much easier
to retain log files.
YARN client mode
Step 1: In YARN client mode, the interaction with YARN starts when a new SparkContext
instance is constructed by the driver program.
Step 2: The context submits a YARN application to the YARN resource manager.
Step 3: YARN resource manager, which starts a YARN container on a node manager in the
cluster and runs a Spark ExecutorLauncher application master in it.
Step 4: The job of the ExecutorLauncher is to start executors in YARN containers, which it
does by requesting resources from the resource manager.
Step 5: Then launching ExecutorBackend processes as the containers are allocated to it.
As each executor starts, it connects back to the SparkContext and registers itself.
This information gives the SparkContext information about the number of executors
available for running tasks and their locations,which is used for making task placement
decisions.
The number of executors that are launched is set in spark-shell,spark-submit, or py-spark
along with number of cores and the amount of memory that each executor uses.
An example showing how to run spark-shell on YARN with four executors, each using
one core and 2 GB of memory:
% spark-shell –master yarn-client \
--num-executors 4 \
--executor-cores 1 \
--executor-memory 2g
YARN Cluster mode
In YARN cluster mode, the user’s driver program runs in a YARN application master
process.
The spark-submit command is used with a master URL of yarn-cluster:
% spark-submit –master yarn-cluster …
All other parameters, like –num-executors and the application JAR are the same as for
YARN client mode.
Step 1: The spark-submit client will launch the YARN application,but it doesn’t run any user
code.
Step 2: The context submits a YARN application to the YARN resource manager.
Step 3a: YARN resource manager, which starts a YARN container on a node manager in the
cluster and the application master starts the driver program (step3b)before allocating
resources for executors(step4).
In both YARN modes, the executors are launched before there is any data locality
information available, so that they can end up not being co-located on the datanodes hosting
the files that the jobs access.