0% found this document useful (0 votes)
9 views43 pages

Module4 Pig Notes

Apache Pig is a high-level platform for processing large datasets in Hadoop using a scripting language called Pig Latin, which simplifies the complexity of writing MapReduce code. It features an execution environment that supports both local and Hadoop modes, and allows for operations on nested and multivalued data. Pig's architecture includes a parser, optimizer, compiler, and execution engine, enabling efficient data processing without requiring developers to write extensive MapReduce code.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

Module4 Pig Notes

Apache Pig is a high-level platform for processing large datasets in Hadoop using a scripting language called Pig Latin, which simplifies the complexity of writing MapReduce code. It features an execution environment that supports both local and Hadoop modes, and allows for operations on nested and multivalued data. Pig's architecture includes a parser, optimizer, compiler, and execution engine, enabling efficient data processing without requiring developers to write extensive MapReduce code.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PIG

Background:

What is Apache Pig?


Apache Pig is a high-level platform for processing large datasets in
Hadoop.
It provides an abstraction over MapReduce and uses a scripting
language called Pig Latin.
Pig was developed by Yahoo! to make data analysis easier for
developers and researchers.

Why Use Pig?


Writing raw MapReduce code is complex and time-consuming.
Pig allows you to write simple scripts instead of long Java programs.
Reduces development time significantly — just a few lines of Pig Latin
can process terabytes of data.
Components of Pig
Pig Latin:
A data flow scripting language used to write logic.

Execution Environment:
Local Mode – Runs in a single JVM (for small data/testing)
Hadoop Mode – Runs distributed on a Hadoop cluster (for big data)

Features of Pig:
● High-level Abstraction for MapReduce jobs
● Works with nested and multivalued data (like tuples and bags)
● Supports powerful operations:
JOIN, FILTER, GROUP, FOREACH, ORDER BY, etc.
● Allows custom UDFs (User Defined Functions) for flexible
processing
● Pig automatically converts Pig Latin scripts into MapReduce jobs
How Pig Works
A Pig Latin script is a series of operations on input data.
The script defines a data flow.
Pig engine translates it into MapReduce jobs and runs them behind the
scenes.
The developer focuses on what to do with data, not how to do it in
MapReduce.
● -- Load data
● A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray,
salary:int);
● -- Filter high salary employees
● B = FILTER A BY salary > 50000;
● -- Group by name
● C = GROUP B BY name;
● -- Count how many times each name appears
● D = FOREACH C GENERATE group, COUNT(B);
● -- Store result
● STORE D INTO 'output';
Feature MapReduce Apache Pig

Language Java-based Scripting (Pig Latin)

Development Long Short


Time
Easy to write &
Complexity High
understand
Structured, semi-
Data Handling Structured only
structured, nested data
Low (code-
Reusability High (via UDFs)
specific)
Pig's Advantages
● Easy to learn for non-Java users

● Faster development cycle compared to MapReduce

● Supports sampling & debugging tools for quick testing

● Extensible with User Defined Functions (UDFs)


Pig Architecture:
1. Parser

● The Parser reads Pig scripts written in Pig Latin.


● It performs initial checks like:
○ Syntax validation (Are commands correct?)
○ Type checking (Are data types compatible?)

● Output:
● It creates a Directed Acyclic Graph (DAG):
○ Nodes = Logical operators (e.g., LOAD, FILTER, JOIN)
○ Edges = Flow of data between operations

● Example:
● If a script has 3 steps – Load → Filter → Group –
the parser builds a graph showing this data flow structure.
2. Optimizer
After parsing, the DAG is sent to the Logical Optimizer.
The optimizer improves the plan using techniques like:
✅ Projection Pushdown:
Loads only the necessary columns
Reduces data early to improve performance
✅ Filter Pushdown:
Applies filters closer to the data source
Avoids loading unnecessary rows
Benefit:
Increases query speed and reduces resource usage

3. Compiler
Converts the optimized logical plan into actual MapReduce jobs
Breaks the Pig script into multiple MapReduce stages automatically
Smart Feature:
Reorders operations if needed to make execution more efficient
Uses data flow understanding to optimize execution
(e.g., rearranging JOINs and FILTERs for better performance)
Advantage:
Programmers don’t need to manually tune or write MapReduce code
Pig handles low-level optimization in the background

4. Execution Engine
The Execution Engine is the final stage in Pig's internal flow.
It receives the MapReduce jobs created by the Compiler.
These jobs are then submitted to the Hadoop system for execution.
What Happens:
Hadoop runs the MapReduce jobs.
Final results are stored in HDFS (or printed in local mode).
The user gets the desired output without writing a single MapReduce
line.

5. Pig Execution Modes


Pig supports two types of execution modes depending on the
environment:
A. Local Mode
● Used for small or sample datasets
● Runs on a single JVM (Java Virtual Machine)
● Uses local file system instead of HDFS
● No need for a Hadoop cluster
● Best For: Testing, learning, and development
● Lightweight scripts and trial runs
● Example: pig -x local script.pig

B. MapReduce Mode (MR Mode)


● Used for large datasets
● Requires a running Hadoop cluster and HDFS
● Pig scripts are automatically converted to MapReduce jobs
● Supports parallel processing
● Best For:
● Real-time big data processing
● Production environments with distributed systems
● Example: pig -x mapreduce script.pig
Pig supports 3 complex types: Tuple, Bag, and Map

1. Tuple
● A tuple is an ordered set of fields (like a row in a table).
● Fields can be of any type (simple or complex).
(id:int, name:chararray, city:chararray)
student = (1, 'Rajiv', 'Hyderabad');
1 → int , 'Rajiv' → chararray, 'Hyderabad' → chararray

2. Bag
● A collection of tuples (like a table inside a table).
● Bags are unordered and can have duplicate tuples.
● Represented using {}.
{ (field1, field2), (field1, field2), ... }

students_bag = { (1, 'Rajiv'), (2, 'Siddarth'), (3, 'Rajesh') };


3. Map
● A collection of key–value pairs.
● Keys are always chararray (string).
● Values can be of any type.
● Represented using []
['key1'#value1, 'key2'#value2]
student_map = ['id'#1, 'name'#'Rajiv', 'city'#'Hyderabad'];

Access values using keys:


student_map#'name' -- Returns 'Rajiv'

.
Latin Basics:

Pig Latin is a scripting language used in Apache Pig for analyzing large
datasets in Hadoop.
It is a data flow language — each statement describes how data should
be loaded, processed, and stored.
✅ Key Features of Pig Latin:
● High-level, easy-to-understand language
● Each line (statement) performs an operation on data
● Operates on relations (like tables)
● Converts Pig Latin scripts into MapReduce jobs in the background

● Pig Latin Statements


Pig Latin programs are made up of statements, which are:
● Basic building blocks of Pig scripts
● Always end with a semicolon (;)
● Work on relations (tables of data)
● Include expressions, functions, schemas, etc.
Types of Pig Statements:
Statement Purpose
LOAD Load data into Pig
STORE Save output to storage (like HDFS)
DUMP Print output to console
FILTER Remove unwanted rows
FOREACH Apply operations on each row
GROUP Group data
JOIN Join two relations
ORDER Sort data
DISTINCT Remove duplicates
Important Behavior:
Most Pig Latin statements take a relation as input and produce a
relation as output.
Exception: LOAD and STORE interact with external files.
Once a LOAD is written, semantic checking happens (type/schema
checking).
The actual data is not loaded until you run DUMP or STORE.

How Execution Works:


After writing the LOAD statement, nothing runs immediately.
When you use DUMP or STORE, the entire data flow is compiled into
MapReduce jobs and executed.
Pig Execution Modes:
1. Interactive Mode (Grunt Shell)
A command-line shell called Grunt
Allows you to write and run Pig Latin statements one by one
Use:
Best for testing, debugging, and learning
You can DUMP output to see intermediate results
Example:
pig -x local
grunt> A = LOAD 'data.txt' AS (id:int, name:chararray);
grunt> DUMP A;

2. Batch Mode (Script Execution)


Run Pig Latin scripts written in a .pig file
All statements are executed in one go
Use:
Ideal for production, automation, or scheduled jobs
More efficient than interactive mode
Example:
pig -x mapreduce myscript.pig
myscript.pig contains all Pig Latin code

3. Embedded Mode (Using UDFs)


Allows you to embed Pig Latin inside Java code
Supports writing User Defined Functions (UDFs) in Java (or Python)
Use:
Useful when you need custom logic or want to integrate Pig with Java
programs
UDFs allow advanced filtering, grouping, and calculations
Example:
Write a UDF in Java like MyUpperCaseFunction.java
Register and use it in Pig script:
REGISTER 'myudfs.jar';
A = LOAD 'names.txt' AS (name:chararray);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Pig Processing – Loading and Transforming Data:

Apache Pig runs on top of Hadoop and is used for analyzing large
datasets stored in HDFS (Hadoop Distributed File System).
Before analyzing data, you must load it into Pig using the LOAD
operator.

What is the LOAD Operator?


The LOAD operator is used to read data from a file system (either HDFS
or Local) into Pig.
It is the first step in almost every Pig script.

relation_name = LOAD 'input_file_path'


USING function
AS schema;
Explanation of Syntax Components:

Part Description
Name of the relation (table-like object) where
relation_name the data will be stored in memory

Path to the file stored in HDFS (in MR mode)


'input_file_path' or local system (in local mode)

Function used to load data; Pig provides built-


USING function in functions

Defines the structure of data (column names


AS schema
and data types)
Common Load Functions in Pig:

Load
Description
Function

Loads data using comma as delimiter (used


PigStorage(',') most commonly)

Loads plain text data


TextLoader

Loads data in JSON format


JsonLoader

Loads data stored in binary format (used


BinStorage
internally)
(column1: data type, column2 : data type, column3 : data type);
Important Notes:
LOAD is a logical statement: It does not actually load data until a
command like DUMP or STORE is run.

Pig performs semantic checks when the LOAD statement is written


(e.g., checking schema).

Execution starts only when you use a command like DUMP or STORE.
Component Description
LOAD Reads data from HDFS or local file system

PigStorage Most commonly used loader (e.g., for CSV)

Schema Helps Pig understand data format (types, columns)

DUMP/STORE Triggers actual data loading and processing


Store Operator:
What is the STORE Operator?
The STORE operator in Pig Latin is used to save the output data to a file
system (either HDFS or local file system).
After data is processed in Pig, STORE helps to persist the result.

STORE relation_name INTO 'output_directory_path' [USING


function];
Part Description
The name of the relation (data you want
relation_name
to save)
INTO Specifies where to store the data
The HDFS or local path where data will
'output_directory_path'
be saved
Used to define the storage format (like
USING function (optional)
PigStorage, BinStorage, etc.)
After performing all the data operations (e.g., LOAD, FILTER, GROUP,
etc.), use STORE to write the final result.

Triggers execution: Just like DUMP, the STORE command executes the
data flow and generates the output files.
Example: Storing Output with PigStorage(':')
A = LOAD 'fruit_data.txt' USING PigStorage(',') AS (name:chararray,
fruit:chararray, quantity:int);
STORE A INTO 'out' USING PigStorage(':');

Output (cat out/part-m-00000):


Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7
Output is saved in the folder out and values are separated by colons
(:).
Function Description
Stores data as plain text with comma as
PigStorage(',')
separator
PigStorage(':') Stores data with colon separator
TextStorage Stores as plain text lines
Stores data in binary format (default if not
BinStorage
specified)

The STORE operator saves the processed data to HDFS or local file
system.
You can customize the output format using the USING clause.

Unlike DUMP (which displays data on console), STORE writes data


permanently.

You must ensure the output directory does not already exist in HDFS —
or it will throw an error.
Example 1 – Load CSV from HDFS:
employee = LOAD '/user/hadoop/emp.csv' USING PigStorage(',')
AS (id:int, name:chararray, salary:float);
This will load a CSV file with 3 columns into a relation called employee.

Example 2 – Load from Local File System (in local mode):


student = LOAD 'C:/data/student.txt' USING PigStorage('\t')
AS (roll:int, name:chararray, marks:int);
Loads tab-separated data from a local file.
Pig Built-In Functions:
Eval Functions
Eval functions in Pig are built-in functions that help in performing
calculations and data transformations on bags, tuples, or fields.

What is a Bag in Pig?


A Bag is a collection of tuples (similar to rows in a table).

Think of it like a grouped dataset where eval functions are applied for
aggregation or comparison.

1. AVG(col)
Purpose: Computes the average (mean) of numerical values in a
column within a bag.
Usage: Mostly used after GROUP BY. Excludes nulls.
Example:
A = GROUP data BY department;
B = FOREACH A GENERATE group, AVG(data.salary);
2. CONCAT(string1, string2)
Purpose: Joins two string expressions (must be of same type).
Output: A single combined string.
Example:A = FOREACH data GENERATE CONCAT(first_name,
last_name);

3. COUNT(bag)
Purpose: Counts the number of non-null tuples in a bag.
Ignores null values.
Example:
A = GROUP data BY dept; B = FOREACH A GENERATE group,
COUNT(data);

4. COUNT_STAR(bag)
Purpose: Counts all elements including nulls in a bag.
Useful when nulls must be included in analysis.
Example:
A = GROUP data BY dept; B = FOREACH A GENERATE group,
5. DIFF(bag1, bag2)
Purpose: Compares two bags and returns elements that are different
(present in one but not the other).
Acts like set difference.
Example:
-- bag1 and bag2 are grouped relations result = DIFF(bag1, bag2);

6. IsEmpty(DataBag bag) / IsEmpty(Map map)


Purpose: Checks whether a bag or a map is empty.
Returns: true if the structure is empty, otherwise false.
Example: A = FILTER data BY IsEmpty(data.bag_column);

7. MAX(col)
Purpose: Returns the maximum value in a single-column bag (number
or character). Ignores nulls.
Example:A = GROUP sales BY region;
B = FOREACH A GENERATE group, MAX(sales.amount);
8. MIN(col)
Purpose: Returns the minimum value in a single-column bag.
Works like MAX but gives the lowest value.
Example: A = GROUP sales BY region;
B = FOREACH A GENERATE group, MIN(sales.amount);
9. DEFINE pluck pluckTuple(expression)
Purpose: Filters and selects columns that start with a specific prefix
string.
pluckTuple is a user-defined function (UDF) that filters and selects only
those columns whose names start with a specific prefix string.
Helps to dynamically select fields based on naming pattern.
Example:
DEFINE pluck pluckTuple('emp_'); B = FOREACH A GENERATE pluck(*);
10. SIZE(expression)
Purpose: Returns the number of elements in: a tuple, a bag, a map, a
string (returns length)
Example:A = FOREACH data GENERATE SIZE(name); -- Length of string B
= FOREACH group_data GENERATE SIZE(bag_column); Number of tuples
11. SUBTRACT(bag1, bag2)
Purpose: Returns a bag containing elements in bag1 that are not in
bag2.
Works like a set difference.
Example: result = SUBTRACT(bagA, bagB);

12. SUM(col)
Purpose: Adds all the numerical values in a single-column bag.
Ignores null values.
Example: A = GROUP sales BY region; B = FOREACH A GENERATE group,
SUM(sales.amount);

13. TOKENIZE(string [, delimiter])


Purpose: Splits a string into words or tokens based on a delimiter
(default is space).
Returns a bag of words.
Example:
A = FOREACH data GENERATE TOKENIZE(description) AS words;
Filtering:
The FILTER operator is used to extract only the rows (tuples) from a
relation that match a condition.
Think of it like the WHERE clause in SQL.

Syntax of FILTER:
new_relation = FILTER old_relation BY (condition);

Example Dataset – StudentInfo.txt


1, Rajiv, Reddy, 21, 9848022337, Hyderabad
2, Siddarth, Battacharya, 22, 9848022338, Kolkata
3, Rajesh, Khanna, 22, 9848022339, Delhi
4, Preethi, Agarwal, 21, 9848022330, Pune
5, Trupthi, Mohanthy, 23, 9848022336, Bhubaneswar
6, Archana, Mishra, 23, 9848022335, Chennai
7, Komal, Nayak, 24, 9848022334, Trivandrum
8, Bharathi, Nambiayar, 24, 9848022333, Chennai
Step-by-Step Code Example:
1. Load the file:
student_details =
LOAD 'hdfs://localhost:9000/pig_data/StudentInfo.txt'
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
This loads the student records and assigns proper schema (columns
with data types).

2. Apply Filter – Students from Chennai:


filter_data = FILTER student_details BY city == 'Chennai';
This extracts only the rows where the city column is equal to 'Chennai'.
Feature Description
Purpose Selects tuples based on a condition
Condition Can use ==, <, >, !=, AND, OR
Data Type Works on int, chararray, float, etc.
Output A new relation with filtered data
Grouping and Sorting:
Grouping in Pig
The GROUP operator is used to group data in a relation based on a
specific field (key).
It is similar to GROUP BY in SQL.
After grouping, you can apply aggregation functions like COUNT, AVG,
SUM, etc.
Syntax:
grouped_data = GROUP relation_name BY field_name;

Example: Group Students by Age


student_details = LOAD
'hdfs://localhost:9000/pig_data/StudentInfo.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray); grouped_data = GROUP
student_details BY age;

This will group all students who have the same age into one group.
Sorting in Pig

The ORDER BY operator is used to sort the data in ascending (ASC) or


descending (DESC) order based on one or more fields.
Like ORDER BY in SQL.
Syntax: sorted_data = ORDER relation_name BY field_name ASC;
Or
sorted_data = ORDER relation_name BY field_name DESC;
Example: Sort Students by Age
sorted_data = ORDER student_details BY age ASC;
This will sort all student records by age in ascending order.

Example: Sort Students by City in Descending Order


sorted_data = ORDER student_details BY city DESC;
What is Pig Latin?
Pig Latin is the scripting language used in Apache Pig to process and
analyze large datasets.

It supports data flow programming and resembles SQL, but it’s


procedural (step-by-step instructions).

Pig Latin runs on Hadoop and converts scripts into MapReduce jobs
internally.

Pig Latin is a high-level scripting language used with Apache


Pig to analyze large datasets on Hadoop.
1. LOAD
Loads data from the file system into a relation.
relation = LOAD 'file_path' USING PigStorage(',')
AS (field1:type, field2:type, ...);
Example:
student_details = LOAD
'hdfs://localhost:9000/pig_data/StudentInfo.txt'
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
2. STORE
Saves the results of a relation into HDFS or local file system.
STORE relation INTO 'output_path' USING PigStorage(',');

3. DUMP
Displays the contents of a relation on the console.
DUMP student_details;

4. FILTER
Filters records based on a condition (like WHERE in SQL).
5. FOREACH … GENERATE
Projects (selects) required columns or applies transformations.
names = FOREACH student_details GENERATE firstname, city;

6. GROUP
Groups records by a field (like GROUP BY in SQL).
grouped_data = GROUP student_details BY age;

7. COGROUP
Groups multiple relations by a common field.
COGROUP groups two (or more) relations on a common key.
cogrouped = COGROUP relation1 BY id, relation2 BY id;

8. JOIN
Joins two or more relations based on a condition.
joined = JOIN relation1 BY id, relation2 BY id;
9. CROSS
Cartesian product (every row of one relation with every row of
another).
crossed = CROSS relation1, relation2;

10. UNION
Combines data from two or more relations (must have same
schema).
combined = UNION relation1, relation2;

11. DISTINCT
Removes duplicate tuples.
unique_cities = DISTINCT student_details;

12. ORDER
Sorts the relation by one or more fields.
sorted = ORDER student_details BY age ASC;
13. LIMIT
Limits the number of output records.
top5 = LIMIT student_details 5;

14. SAMPLE
Extracts a sample of data.
sample_data = SAMPLE student_details 0.3;
(30% random sample)

15. DESCRIBE
Shows the schema of a relation.
DESCRIBE student_details;

16. ILLUSTRATE
Shows how Pig transforms the data step by step.
ILLUSTRATE student_details;
17. EXPLAIN

Displays the logical, physical, and MapReduce execution plan.


EXPLAIN student_details;
With these commands, you can perform loading, transformation,
filtering, grouping, joining, sorting, and saving results in Pig.

You might also like