0% found this document useful (0 votes)

15 views26 pages

Big Data Module V Notes

The document provides an overview of Apache Pig and Apache Spark, two frameworks for processing large datasets. Apache Pig utilizes a data flow language called Pig Latin for ETL and data analysis, while Apache Spark offers a fast, scalable solution for big data processing using in-memory storage and a flexible API. Key features, installation instructions, execution modes, and examples of both frameworks are discussed, highlighting their capabilities and use cases.

Uploaded by

shashikalareddy05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views26 pages

Big Data Module V Notes

Uploaded by

shashikalareddy05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE 5

APACHE PIG AND APACHE SPARK

Pig
Apache Pig is a high-level platform for processing large datasets, built on top of Apache
Hadoop.
Pig is madeup of two pieces:
1. The language used to express data flows, called Pig Latin.
2. The execution environment to run Pig Latin Programs.
There are currently two environments: Local execution in a single JVM and distributed
execution on a Hadoop cluster.
Pig Latin is a data flow language, meaning it is used to define a sequence of operations that
will be performed on data, such as loading, transforming, filtering, grouping, and storing, that
are applied to the input data to produce output.

Key features of Pig:

1. Ease of use: Pig Latin is similar to SQL, making it accessible for those familiar with
relational databases.
2. Data Flow: it allows users to express data transformations and analyses in a data flow
manner.
3. Extensibility: Users can write custom functions (UDFs) in Java, Python or JavaScript.
4. Optimization opportunities: Pig optimizes the execution of tasks automatically, allowing
for better performance.
Use Cases:
 ETL (Extract, Transform, Load).
 Data analysis and reporting.
 Batch Processing of Large datasets.

1) Installing and Running Pig

 Pig runs as a client-side application.

 Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the
cluster.
 Pig launches jobs and interacts with HDFS (or other Hadoop file systems) from your
workstation.

Try typing pig -help to get usage instructions.

2) Execution Types
Pig has two execution types or modes: local mode and Map Reduce mode.

 Local mode

In local mode, Pig runs in a single JVM and accesses the local file system. This mode is
suitable only for small datasets and when trying out Pig.

The execution type is set using the -x or –exec type option. To run in local mode, set the
option to local:

% pig -x local

grunt>

 MapReduce mode

In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster.

To use MapReduce mode, you first need to check that the version of Pig you down- loaded is
compatible with the version of Hadoop you are using.

3) Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:

Script: Pig can run a script file that contains Pig commands.

Grunt: Grunt is an interactive shell for running Pig commands. Grunt is started when no file
is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts
from within Grunt using run and exec.

Embedded: You can run Pig programs from Java using the Pig Server class. For
programmatic access to Grunt, use Pig Runner.

Grunt

a JavaScript task runner, a tool used to automatically perform frequent tasks such as
minification, compilation, unit testing, and linting

Grunt is primarily used to automate tasks that need to be performed routinely.

Pig Latin Editors

PigPen is an Eclipse plug-in that provides an environment for developing Pig programs. It
includes a Pig script text editor,
Comparison with Databases

Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyse the data in Hadoop.
pache Pig includes many built-in operators to help with data operations such as joins, filters,
sorting, etc. Furthermore, it adds nested data types such as tuples, bags, and maps that
MapReduce lacks.

Structure: A Pig Latin program consists of a collection of statements. A statement can be

thought of as an operation, or a command.5 For example, a GROUP operation is a type of
statement:

grouped_records = GROUP records BY year;

The command to list the files in a Hadoop filesystem is another example of a statement:

ls /

Statements are usually terminated with a semicolon, C-style comments are more flexible
since they delimit the beginning and end of the comment block with /* and */ markers.

* Description of my program spanning

* multiple lines.

*/
Statements: As a Pig Latin program is executed, each statement is parsed in turn. If there are
syntax errors, or other (semantic) problems such as undefined aliases, the interpreter will halt
and display an error message.

The interpreter builds a logical plan for every relational operation, When the Pig Latin
interpreter sees the first line containing the LOAD statement, it confirms that it is
syntactically and semantically correct, and adds it to the logical plan.
 Expressions
Pig has a rich variety of expressions, many of which will be familiar from other programming
languages.
 Types
 Pig has four numeric types: int, long, float, and double.
 The numeric, textual, and binary types are simple atomic types.
 Pig Latin also has three complex types for representing nested structures: tuple, bag, and
map
 Functions
Functions in Pig come in four types:

i) Eval function: A function that takes one or more expressions and returns another
expression. An example of a built-in eval function is MAX, which returns the
maximum value of the entries in a bag.
ii) Filter function: A special type of eval function that returns a logical boolean
result. As the name suggests, filter functions are used in the FILTER operator to
remove unwanted rows. An example of a built-in filter function is IsEmpty, which
tests whether a bag or a map contains any items.
iii) Load function: A function that specifies how to load data into a relation from
external storage.
iv) Store function: A function that specifies how to save the contents of a relation to
external storage. Often, load and store functions are implemented by the same
type. For example, PigStorage, which loads data from delimited text files, can
store data in the same format.

 Data Processing Operators

1. Loading and Storing Data: Here’s an example of using Pig Storage to store tuples as
plain-text values separated by a colon character:

grunt> STORE A INTO 'out' USING PigStorage(':');

grunt> cat out

Joe:cherry:2

Ali:apple:3

2. Grouping and Joining Data: Joining datasets in MapReduce takes some work on the part
of the programmer whereas Pig has very good built-in support for join operations, making it
much more approachable.
JOIN

Let’s look at an example of an inner join. Consider the relations A and B:

grunt> DUMP A;

(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;

(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
We can join the two relations on the numerical (identity) field in each:

grunt> C = JOIN A BY $0, B BY $1;

grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)

3. Sorting Data: Relations are unordered in Pig. Consider a relation A:

grunt> DUMP A;
(2,3)
(1,2)
(2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any order. If
you want to impose an order on the output, you can use the ORDER operator to sort a
relation by one or more fields.
The following example sorts A by the first field in ascending order and by the second
field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)

4. Combining and Splitting Data: Sometimes you have several relations that you would like
to combine into one. For this,the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)
C is the union of relations A and B, and since relations are unordered, the order of the
tuples in C is undefined.

APACHE SPARK
Apache Spark
 Apache Spark is an open-source, distributed computing framework that provides a fast and
efficient solution for large scale data processing.
 Spark is designed to handle big data processing tasks by utilizing a cluster of computers,
reducing processing time compared to traditional data processing frameworks.
 It can handle up to petabytes (that’s millions of gigabytes!) of data, and manage up to
thousands of physical or virtual machines.

Why spark ?
Speed: It uses an in-memory data storage mechanism, which enables faster processing of
data compared to traditional disk-based data processing frameworks like Hadoop
MapReduce.
Scalability: Spark can be easily scaled to handle growing data volume and processing
demands by adding more nodes to the cluster.
Flexibility: Spark provides high-level APIs in multiple programming languages, including
Scala, Java, Python, and R, which makes it easier for developers to work with Spark,
regardless of their preferred programming language.
Ease of Use: Spark provides an intuitive and easy-to-use API that abstracts away the
complex details of distributed computing, making it easier for developers to build and
maintain data processing pipelines.
Integration: Spark integrates with popular big data tools, such as Hadoop, Hive, and HBase,
making it a popular choice for large scale data processing in the big data ecosystem.
Resilience: Spark uses the Resilient Distributed Datasets (RDD) concept, which provides
fault tolerance and the ability to recover from node failures.

 Spark application
 An Spark application is a program built with Spark APIs and runs in a Spark compatible
cluster/environment.

 It can be a PySpark script, a Java application, a Scala application, a SparkSession started

by spark-shell or spark-sql command, a AWS EMR Step, etc.

 A Spark application consists of a driver container and executors.

 Example of SPARK in PYTHON:
This script will perform these processes:
1.Create a DataFrame named df1 by reading data from HDFS.
2.Create another DataFrame named df2 from another HDFS path.
3.Create a new DataFrame named df using inner join between df1 and df2 and then print it
out.
4.Finally save df into HDFS using Parquet format.
5.Run the application using the following command:

spark-submit spark-basic.py

 Word Count Java example for Spark:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import scala.Tuple2;

import java.util.Arrays;

public class SparkWordCount {

public static void main(String[] args) {

// Step 1: Set up Spark configuration and create a Spark context

SparkConf conf = new SparkConf().setAppName("Word Count").setMaster("local");

JavaSparkContext sc = new JavaSparkContext(conf);

// Step 2: Load input data (a text file) into an RDD

String inputFile = "input.txt"; // Ensure input.txt exists in your file system

JavaRDD<String> inputData = sc.textFile(inputFile);

// Step 3: Split the lines into words (transformations)

JavaRDD<String>words=inputData.flatMap(line>Arrays.asList(line.split("")).iterator()
);

// Step 4: Map each word to a (word, 1) pair

JavaPairRDD<String,Integer>wordCounts=words.mapToPair(word>newTuple2<>(wor
d,1));

// Step 5: Reduce the pairs by key (word) to count the occurrences of each word

JavaPairRDD<String, Integer> counts = wordCounts.reduceByKey(Integer::sum);

// Step 6: Save the result to a text file

counts.saveAsTextFile("output"); // Output will be saved in "output" directory

// Step 7: Stop the Spark context

sc.stop();

}
}

 Spark jobs and stages

 A Spark job is a parallel computation of tasks. Each action operation will create one
Spark job.
 Each Spark job will be converted to a DAG which includes one or more stages.
 A Spark stage is a smaller sets of tasks that depend on each other.
 Stages are created for each job based on shuffle boundaries, i.e. what operations can be
performed serially or in parallel.
 Not all Spark operations or actions can happen in a single stage without data shuffling,
thus they may be divided into multiple stages.
 For example, an operation involves data shuffling will lead to the creation of a new
stage.
 If there is no data shuffling for the job, there is usually one stage created.

 Spark tasks
 A Spark task is a single unit of work or execution that runs in a Spark executor.
 It is the parallelism unit in Spark. Each stage contains one or multiple tasks.
 Each task is mapped to a single core and a partition in the dataset.
 In the above example, each stage only has one task because the sample input data is
stored in one single small file in HDFS.
 If you have a data input with 1000 partitions, then at least 1000 tasks will be created
for the operations.
 Resilient Distributed Datasets

 Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark that
provide a fault-tolerant, distributed collection of objects.
 Here’s an overview of RDDs, their characteristics, and some basic operations
Creating RDDs
You can create RDDs in several ways:
1. From an existing collection:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
2. From a data file:
rdd = spark.sparkContext.textFile
 Basic Operations
 Transformations:
Transformations create a new RDD from an existing one. Common transformations include:
1. Map: Applies a function to each element.
squared_rdd = rdd.map(lambda x: x * x)
2. Filter: Returns a new RDD containing only the elements that satisfy a condition.
even_rdd = rdd.filter(lambda x: x % 2 == 0)
3. FlatMap: Similar to map, but each input element can produce multiple output elements.
flat_mapped_rdd = rdd.flatMap(lambda x: (x, x + 1))
4. Union: Combines two RDDs
union_rdd = rdd.union(another_rdd)

 Actions
Actions trigger the execution of transformations and return results. Common actions include:
1.Collect: Returns all elements of the RDD to the driver program.
result = squared_rdd.collect()
2.Count: Returns the number of elements in the RDD.
count = rdd.count()
3.First: Returns the first element of the RDD.
first_element = rdd.first()
4.Take: Returns the first n elements.
first_three = rdd.take(3)

 Persistence in Spark
 In Apache Spark, persistence (also known as caching) allows you to store intermediate
results of an RDD (Resilient Distributed Dataset) or DataFrame in memory (or disk) for
faster access in subsequent operations.
 By default, Spark computes RDDs and DataFrames from scratch every time an action is
called.
 However, if the same dataset is used multiple times in your program, you can persist it to
avoid recomputation, thereby improving performance.

 Types of Persistence Levels:

Spark provides various persistence levels depending on where you want to store the data:

1. MEMORY_ONLY: Stores RDD or DataFrame in memory as deserialized Java

objects. If the data does not fit in memory, Spark will recompute it whenever
required.
2. MEMORY_AND_DISK: Stores data in memory as deserialized Java objects. If the
data does not fit in memory, it stores the rest on disk. When Spark needs the data, it
reads from the disk if necessary.

3. MEMORY_ONLY_SER: Similar to MEMORY_ONLY, but the data is stored in

serialized format (compressed), which reduces memory usage at the cost of increased
CPU time for serialization and deserialization.

4. MEMORY_AND_DISK_SER: Similar to MEMORY_AND_DISK, but the data is

stored in serialized format both in memory and on disk.

5. DISK_ONLY: Stores the data only on disk, which avoids memory consumption, but
accessing data from disk is slower.

6. OFF_HEAP: Stores the data off the Java heap (in memory outside of the Java Virtual
Machine) when MEMORY_AND_DISK_SER is used with off-heap settings enabled.

Using cache( ) and persist( )

 cache( ): Equivalent to persist(Storage Level. MEMORY_ONLY). It stores the data

in memory only.

 persist(): Allows you to specify the storage level. It is more flexible than cache().

Persistence is a key feature in Apache Spark that allows you to optimize your application by
storing intermediate results either in memory, disk, or a combination of both.
By using persist() and cache() intelligently, you can greatly improve the performance of
applications that reuse data across multiple actions or have expensive computations.

 Serialization

 Serialization in Apache Spark applies to both data and functions (closures) because Spark
operates in a distributed environment.
 When data or functions are sent from the driver node to the worker nodes for distributed
processing, Spark serializes them, ensuring they can be transferred across the network and
executed in a distributed fashion.
Let’s break down how serialization of data and functions works in Spark:
1. Serialization of Data in Spark
Data in Spark is distributed across the nodes in the cluster. To enable distributed processing,
Spark serializes the data that needs to be transferred over the network or stored on disk. This
includes:
 RDDs: Resilient Distributed Datasets (RDDs) are the core data structure of Spark.
When RDDs are created, transformed, or shuffled across nodes, their data gets
serialized.
 DataFrames and Datasets: Similarly, when operations are performed on DataFrames
or Datasets, the underlying data may need to be serialized, especially during shuffling
or caching.

2. Serialization of Functions in Spark

 In Spark, functions (closures) need to be serialized because Spark runs these functions
on worker nodes, not on the driver.
 When you write a function inside an RDD transformation (like map, filter, reduceByKey,
etc.), Spark must serialize this function (its closure) and send it to the worker nodes.

A closure in Spark refers to any variables, objects, or functions that are referenced within a
transformation or action, and must be serialized to run on the worker nodes.

 Broadcast variable:
 When you have a small lookup table that you need to join with a larger dataset, it can be
more efficient to broadcast the lookup table to all the executors rather than sending it over
the network for every join.

 Here is a simple example of using broadcast variables in Spark

 Anatomy of a Spark Job Run

The anatomy of a Spark job run involves multiple stages, from defining the job in the
application code to the execution of tasks across a distributed cluster.
At the highest level,there are two independent entities:

 Driver: which hosts the application (Spark context) and schedules tasks for a job.

 Executors: which are exclusive to the application, run for the duration of the
application, and execute the application’s tasks.

Job Submission:

 Fig illustrates how Spark runs a job.

 A Spark job is submitted automatically when an action(such as count()) is performed
on an RDD.

Step1: Internally this causes runjob() to be called on the SparkContext,which passes the call
on to the scheduler that runs as a part of the driver.

Step 2: The Scheduler is made up of two parts:a DAG scheduler that breaks down the job into
a DAG of stages, and a task scheduler that is responsible for submitting the tasks from each
stage to the cluster.
 DAG
What is a DAG

A cyclical graph of stages. As soon as you apply transformations on rdds.

DAG creates a physical plan and shows the applied transformation of stages.

DAG Construction:

There are two types: shuffle map tasks and result tasks.

Shuffle map tasks:

As the name suggests, shuffle map tasks are like the map-side part of the shuffle in
MapReduce.

Each shuffle map task runs a computation on one RDD partition and,based on a portioning
function, writes its output to a new set of partitions,which are fetched to a later stage.

Shuffle map tasks run in all stages except the final stage.

Result tasks:

Result tasks run in the final stage that returns the result to the user’s program.

How its Working

Spark translates the RDD transformations into something called DAG (Directed Acyclic
Graph) and starts the execution,

In Spark, a job is divided into multiple stages, and each stage consists of a series of

transformations on the data. The transformations in a stage are executed in parallel, and the

stages are executed in a specific order, as determined by the dependencies between the stages.

At high level, when any action is called on the RDD, Spark creates the DAG and submits to

the DAG scheduler.

 The DAG Scheduler divides the stages into number of tasks. A stage is comprised of
tasks based on partitions of the input data. The DAG scheduler pipelines operators

together. For e.g. Many map operators can be scheduled in a single stage. The final result

of a DAG scheduler is a set of stages.

 The Stages are passed on to the Task Scheduler. The Task Scheduler launches tasks via

cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about

dependencies of the stages.

 The Worker executes the tasks on the Slave.

Task Scheduling
 When the task scheduler receives the set of tasks, it uses its list of executors that are
running for the application and constructs a mapping of tasks to executors on the basis of
placement preference.
 For a given executor, the scheduler will first assign process-local tasks, then node-local
task, then rack-local task, before assigning any arbitrary task or random task.
 Executors send status update to the driver when a task is completed or has failed. In case
of task failure, task scheduler resubmits the task on another executor.
 It also launches speculative tasks for tasks that are running slowly if this feature is
enabled.
 Speculative tasks are duplicates of existing tasks, which the scheduler may run as a
backup if a task is running more slowly than expected.
Task Execution
 Executor makes sure that the JAR and file dependencies are up to date.
 It keeps local cache of dependencies from previous tasks.
 It deserializes the task code (which consists of the user's functions) from the serialized
bytes that were sent as a part of the launch task message.
 Finally the task code is executed.
 Task can return a result to the driver.
 The result is serialized and sent to executor backend, and finally to driver as status
update message.

 Executors and Cluster Managers

Executors in Spark

An executor is a distributed process launched for a Spark application on a worker node in a

cluster. Executors are responsible for executing tasks and storing data for the Spark
application.

Cluster Managers in Spark

The cluster manager is responsible for managing resources and scheduling tasks across a
cluster.

It provides Spark with access to the computing resources (i.e., CPUs, memory) needed to run
Spark applications.

Spark supports multiple cluster managers, including:

1. Local

2. Standalone

3. Mesos

4. YARN

1. Local

 In local mode there is a single executor running in the same JVM as the driver.
 This mode is useful for testing or running small jobs.
 The master URL for this mode is local(use one thread), local[n] (n threads), or
local(*).

2. Standalone Cluster Manager

 In standalone mode, one of the nodes in the cluster acts as a master, and the rest act
as worker nodes.

 The master coordinates the resources and schedules tasks on the workers.

 The master URL is spark://host:port.

 Suitable for small to medium-sized clusters.

3. Mesos

 Apache Mesos is a cluster manager that can dynamically allocate resources across
multiple applications, not just Spark.

 It is highly scalable and can handle clusters of thousands of nodes.

 The master URL is mesos://host:port.

 Spark can run on Mesos in either coarse-grained or fine-grained mode:

o Coarse-grained mode: Allocates a fixed number of resources to Spark for the

entire application runtime.

o Fine-grained mode: Dynamically allocates and deallocates resources during

the runtime, offering more flexibility but with some overhead.

4. YARN (Yet Another Resource Negotiator)

 YARN is the resource manager used in Hadoop.

 Each running Spark application corresponds to an instance of a YARN

application,and each executor runs in its own YARN container.

 The master URL is yarn-client or yarn-cluster.

 Spark on YARN
 Running Spark on YARN provides the tightest integration with other Hadoop
components and is the most convenient way to use Spark when you have an existing
Hadoop Cluster.
 Spark offers two deploy modes for running on YARN:
 YARN Client mode, where the driver runs in the client.
 YARN Cluster mode, where the driver runs on the cluster in the YARN application
master.
YARN Modes

1. YARN Client Mode: This mode is required for programs that have any interactive
component, such as spark-shell or pyspark.

Client mode is also useful when building Spark programs, since any debugging output is
immediately visible.

2. YARN Cluster Mode: YARN cluster mode, on the other hand, is appropriate for
production jobs, since the entire application runs on the cluster, which makes it much easier
to retain log files.

YARN client mode

Step 1: In YARN client mode, the interaction with YARN starts when a new SparkContext
instance is constructed by the driver program.

Step 2: The context submits a YARN application to the YARN resource manager.

Step 3: YARN resource manager, which starts a YARN container on a node manager in the
cluster and runs a Spark ExecutorLauncher application master in it.

Step 4: The job of the ExecutorLauncher is to start executors in YARN containers, which it
does by requesting resources from the resource manager.

Step 5: Then launching ExecutorBackend processes as the containers are allocated to it.
 As each executor starts, it connects back to the SparkContext and registers itself.
 This information gives the SparkContext information about the number of executors
available for running tasks and their locations,which is used for making task placement
decisions.
 The number of executors that are launched is set in spark-shell,spark-submit, or py-spark
along with number of cores and the amount of memory that each executor uses.
 An example showing how to run spark-shell on YARN with four executors, each using
one core and 2 GB of memory:

% spark-shell –master yarn-client \

--num-executors 4 \

--executor-cores 1 \

--executor-memory 2g

 YARN Cluster mode

In YARN cluster mode, the user’s driver program runs in a YARN application master
process.

The spark-submit command is used with a master URL of yarn-cluster:

% spark-submit –master yarn-cluster …

All other parameters, like –num-executors and the application JAR are the same as for
YARN client mode.

Step 1: The spark-submit client will launch the YARN application,but it doesn’t run any user
code.

Step 2: The context submits a YARN application to the YARN resource manager.

Step 3a: YARN resource manager, which starts a YARN container on a node manager in the
cluster and the application master starts the driver program (step3b)before allocating
resources for executors(step4).
In both YARN modes, the executors are launched before there is any data locality
information available, so that they can end up not being co-located on the datanodes hosting
the files that the jobs access.

Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Unit 5
No ratings yet
Unit 5
24 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Big Data Applications: Pig, Hive, HBase
No ratings yet
Big Data Applications: Pig, Hive, HBase
22 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Applications of Apache Pig in Big Data
No ratings yet
Applications of Apache Pig in Big Data
10 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
BDP U4
No ratings yet
BDP U4
58 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Notes
No ratings yet
Notes
19 pages
Unit 5
No ratings yet
Unit 5
19 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Unit No 4 Hadoop Eco System
No ratings yet
Unit No 4 Hadoop Eco System
15 pages
BD 5
No ratings yet
BD 5
28 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Understanding Apache Pig for Data Analysis
No ratings yet
Understanding Apache Pig for Data Analysis
6 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
No ratings yet
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
23 pages
Apache Pig in Nosql Databases
No ratings yet
Apache Pig in Nosql Databases
5 pages
Unit 5
No ratings yet
Unit 5
76 pages
Module - 4 - Unix System Programming (BCS515C)
No ratings yet
Module - 4 - Unix System Programming (BCS515C)
54 pages
Module 5
No ratings yet
Module 5
15 pages
Unit 1
No ratings yet
Unit 1
16 pages
Unit - 1 Hand Written
No ratings yet
Unit - 1 Hand Written
17 pages
Module - 1 - Unix System Programming (BCS515C)
No ratings yet
Module - 1 - Unix System Programming (BCS515C)
28 pages
Module - 2 - Unix System Programming (BCS515C)
No ratings yet
Module - 2 - Unix System Programming (BCS515C)
32 pages
Vector-Logic Computing For Faults-As-Address Deductive Simulation
No ratings yet
Vector-Logic Computing For Faults-As-Address Deductive Simulation
15 pages
1Z0-084 Oracledumpsfree
No ratings yet
1Z0-084 Oracledumpsfree
27 pages
In-Memory Columnar Formats in Flash Cache
No ratings yet
In-Memory Columnar Formats in Flash Cache
2 pages
Informatica V9 Sizing Guide
0% (1)
Informatica V9 Sizing Guide
9 pages
01-Creating The Database Environment
No ratings yet
01-Creating The Database Environment
24 pages
Oracle Database for SAP: Tech & Support
No ratings yet
Oracle Database for SAP: Tech & Support
12 pages
00 Introduction To SAP HANA Chapter 1 Intro To In-Memory Slides en
No ratings yet
00 Introduction To SAP HANA Chapter 1 Intro To In-Memory Slides en
28 pages
Hana DB
No ratings yet
Hana DB
6 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
74 pages
Tableau Interview Q&A Guide PDF
100% (1)
Tableau Interview Q&A Guide PDF
11 pages
Software in Silicon in The Oracle SPARC M7 Processor
No ratings yet
Software in Silicon in The Oracle SPARC M7 Processor
31 pages
Chat Application
100% (1)
Chat Application
20 pages
Sap Hana Nse
No ratings yet
Sap Hana Nse
32 pages
Introduction To Tableau
No ratings yet
Introduction To Tableau
39 pages
Real-Time State Management Techniques Using RocksDB
No ratings yet
Real-Time State Management Techniques Using RocksDB
11 pages
7.0 Overview of SQL Server
No ratings yet
7.0 Overview of SQL Server
35 pages
SAP HANA Interview Questions You'll Most Likely Be Asked
No ratings yet
SAP HANA Interview Questions You'll Most Likely Be Asked
22 pages
9 Web Caching
No ratings yet
9 Web Caching
4 pages
ABAP Questions and Anwers
No ratings yet
ABAP Questions and Anwers
43 pages
Scaling Spring Applications in 4 Easy Steps: Using Gigaspaces Xap and Openspaces
No ratings yet
Scaling Spring Applications in 4 Easy Steps: Using Gigaspaces Xap and Openspaces
21 pages
Spark Developer Certification Guide
100% (2)
Spark Developer Certification Guide
29 pages
DBreeze Documentation Actual
No ratings yet
DBreeze Documentation Actual
133 pages
SAP HANA Essentials
No ratings yet
SAP HANA Essentials
297 pages
SAP HANA On Power Level 2 Quiz
No ratings yet
SAP HANA On Power Level 2 Quiz
20 pages
Upgrade Guide: SQL Server 2008R2 to 2016
No ratings yet
Upgrade Guide: SQL Server 2008R2 to 2016
27 pages
Big Data Analytics Essentials
No ratings yet
Big Data Analytics Essentials
143 pages
SAP HANA Seminar Report 2021-22
No ratings yet
SAP HANA Seminar Report 2021-22
30 pages
Unit 1: Overview SAP S/4HANA
No ratings yet
Unit 1: Overview SAP S/4HANA
1 page
Introduction To HANA - Deep Dive
No ratings yet
Introduction To HANA - Deep Dive
106 pages
A Crash Course in Redis - ByteByteGo Newsletter
No ratings yet
A Crash Course in Redis - ByteByteGo Newsletter
13 pages

Big Data Module V Notes

Uploaded by

Big Data Module V Notes

Uploaded by

MODULE 5

APACHE PIG AND APACHE SPARK

Key features of Pig:

1) Installing and Running Pig

 Pig runs as a client-side application.

Try typing pig -help to get usage instructions.

3) Running Pig Programs

Grunt is primarily used to automate tasks that need to be performed routinely.

Pig Latin Editors

Structure: A Pig Latin program consists of a collection of statements. A statement can be

grouped_records = GROUP records BY year;

* Description of my program spanning

 Data Processing Operators

grunt> STORE A INTO 'out' USING PigStorage(':');

grunt> cat out

Let’s look at an example of an inner join. Consider the relations A and B:

grunt> C = JOIN A BY $0, B BY $1;

3. Sorting Data: Relations are unordered in Pig. Consider a relation A:

 It can be a PySpark script, a Java application, a Scala application, a SparkSession started

 A Spark application consists of a driver container and executors.

 Word Count Java example for Spark:

public class SparkWordCount {

public static void main(String[] args) {

// Step 1: Set up Spark configuration and create a Spark context

SparkConf conf = new SparkConf().setAppName("Word Count").setMaster("local");

JavaSparkContext sc = new JavaSparkContext(conf);

// Step 2: Load input data (a text file) into an RDD

String inputFile = "input.txt"; // Ensure input.txt exists in your file system

JavaRDD<String> inputData = sc.textFile(inputFile);

// Step 3: Split the lines into words (transformations)

// Step 4: Map each word to a (word, 1) pair

JavaPairRDD<String, Integer> counts = wordCounts.reduceByKey(Integer::sum);

// Step 6: Save the result to a text file

counts.saveAsTextFile("output"); // Output will be saved in "output" directory

// Step 7: Stop the Spark context

 Spark jobs and stages

 Types of Persistence Levels:

1. MEMORY_ONLY: Stores RDD or DataFrame in memory as deserialized Java

3. MEMORY_ONLY_SER: Similar to MEMORY_ONLY, but the data is stored in

4. MEMORY_AND_DISK_SER: Similar to MEMORY_AND_DISK, but the data is

Using cache( ) and persist( )

 cache( ): Equivalent to persist(Storage Level. MEMORY_ONLY). It stores the data

2. Serialization of Functions in Spark

 Here is a simple example of using broadcast variables in Spark

 Anatomy of a Spark Job Run

 Fig illustrates how Spark runs a job.

A cyclical graph of stages. As soon as you apply transformations on rdds.

Shuffle map tasks:

How its Working

the DAG scheduler.

of a DAG scheduler is a set of stages.

dependencies of the stages.

 The Worker executes the tasks on the Slave.

 Executors and Cluster Managers

An executor is a distributed process launched for a Spark application on a worker node in a

Cluster Managers in Spark

Spark supports multiple cluster managers, including:

2. Standalone Cluster Manager

 The master URL is spark://host:port.

 Suitable for small to medium-sized clusters.

 It is highly scalable and can handle clusters of thousands of nodes.

 The master URL is mesos://host:port.

 Spark can run on Mesos in either coarse-grained or fine-grained mode:

o Coarse-grained mode: Allocates a fixed number of resources to Spark for the

o Fine-grained mode: Dynamically allocates and deallocates resources during

4. YARN (Yet Another Resource Negotiator)

 YARN is the resource manager used in Hadoop.

 Each running Spark application corresponds to an instance of a YARN

 The master URL is yarn-client or yarn-cluster.

YARN client mode

% spark-shell –master yarn-client \

 YARN Cluster mode

The spark-submit command is used with a master URL of yarn-cluster:

% spark-submit –master yarn-cluster …

You might also like