0% found this document useful (0 votes)

9 views43 pages

Module4 Pig Notes

Apache Pig is a high-level platform for processing large datasets in Hadoop using a scripting language called Pig Latin, which simplifies the complexity of writing MapReduce code. It features an execution environment that supports both local and Hadoop modes, and allows for operations on nested and multivalued data. Pig's architecture includes a parser, optimizer, compiler, and execution engine, enabling efficient data processing without requiring developers to write extensive MapReduce code.

Uploaded by

2403hinhuebhargav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views43 pages

Module4 Pig Notes

Uploaded by

2403hinhuebhargav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PIG

Background:

What is Apache Pig?

Apache Pig is a high-level platform for processing large datasets in
Hadoop.
It provides an abstraction over MapReduce and uses a scripting
language called Pig Latin.
Pig was developed by Yahoo! to make data analysis easier for
developers and researchers.

Why Use Pig?

Writing raw MapReduce code is complex and time-consuming.
Pig allows you to write simple scripts instead of long Java programs.
Reduces development time significantly — just a few lines of Pig Latin
can process terabytes of data.
Components of Pig
Pig Latin:
A data flow scripting language used to write logic.

Execution Environment:
Local Mode – Runs in a single JVM (for small data/testing)
Hadoop Mode – Runs distributed on a Hadoop cluster (for big data)

Features of Pig:
● High-level Abstraction for MapReduce jobs
● Works with nested and multivalued data (like tuples and bags)
● Supports powerful operations:
JOIN, FILTER, GROUP, FOREACH, ORDER BY, etc.
● Allows custom UDFs (User Defined Functions) for flexible
processing
● Pig automatically converts Pig Latin scripts into MapReduce jobs
How Pig Works
A Pig Latin script is a series of operations on input data.
The script defines a data flow.
Pig engine translates it into MapReduce jobs and runs them behind the
scenes.
The developer focuses on what to do with data, not how to do it in
MapReduce.
● -- Load data
● A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray,
salary:int);
● -- Filter high salary employees
● B = FILTER A BY salary > 50000;
● -- Group by name
● C = GROUP B BY name;
● -- Count how many times each name appears
● D = FOREACH C GENERATE group, COUNT(B);
● -- Store result
● STORE D INTO 'output';
Feature MapReduce Apache Pig

Language Java-based Scripting (Pig Latin)

Development Long Short

Time
Easy to write &
Complexity High
understand
Structured, semi-
Data Handling Structured only
structured, nested data
Low (code-
Reusability High (via UDFs)
specific)
Pig's Advantages
● Easy to learn for non-Java users

● Faster development cycle compared to MapReduce

● Supports sampling & debugging tools for quick testing

● Extensible with User Defined Functions (UDFs)

Pig Architecture:
1. Parser

● The Parser reads Pig scripts written in Pig Latin.

● It performs initial checks like:
○ Syntax validation (Are commands correct?)
○ Type checking (Are data types compatible?)

● Output:
● It creates a Directed Acyclic Graph (DAG):
○ Nodes = Logical operators (e.g., LOAD, FILTER, JOIN)
○ Edges = Flow of data between operations

● Example:
● If a script has 3 steps – Load → Filter → Group –
the parser builds a graph showing this data flow structure.
2. Optimizer
After parsing, the DAG is sent to the Logical Optimizer.
The optimizer improves the plan using techniques like:
✅ Projection Pushdown:
Loads only the necessary columns
Reduces data early to improve performance
✅ Filter Pushdown:
Applies filters closer to the data source
Avoids loading unnecessary rows
Benefit:
Increases query speed and reduces resource usage

3. Compiler
Converts the optimized logical plan into actual MapReduce jobs
Breaks the Pig script into multiple MapReduce stages automatically
Smart Feature:
Reorders operations if needed to make execution more efficient
Uses data flow understanding to optimize execution
(e.g., rearranging JOINs and FILTERs for better performance)
Advantage:
Programmers don’t need to manually tune or write MapReduce code
Pig handles low-level optimization in the background

4. Execution Engine
The Execution Engine is the final stage in Pig's internal flow.
It receives the MapReduce jobs created by the Compiler.
These jobs are then submitted to the Hadoop system for execution.
What Happens:
Hadoop runs the MapReduce jobs.
Final results are stored in HDFS (or printed in local mode).
The user gets the desired output without writing a single MapReduce
line.

5. Pig Execution Modes

Pig supports two types of execution modes depending on the
environment:
A. Local Mode
● Used for small or sample datasets
● Runs on a single JVM (Java Virtual Machine)
● Uses local file system instead of HDFS
● No need for a Hadoop cluster
● Best For: Testing, learning, and development
● Lightweight scripts and trial runs
● Example: pig -x local script.pig

B. MapReduce Mode (MR Mode)

● Used for large datasets
● Requires a running Hadoop cluster and HDFS
● Pig scripts are automatically converted to MapReduce jobs
● Supports parallel processing
● Best For:
● Real-time big data processing
● Production environments with distributed systems
● Example: pig -x mapreduce script.pig
Pig supports 3 complex types: Tuple, Bag, and Map

1. Tuple
● A tuple is an ordered set of fields (like a row in a table).
● Fields can be of any type (simple or complex).
(id:int, name:chararray, city:chararray)
student = (1, 'Rajiv', 'Hyderabad');
1 → int , 'Rajiv' → chararray, 'Hyderabad' → chararray

2. Bag
● A collection of tuples (like a table inside a table).
● Bags are unordered and can have duplicate tuples.
● Represented using {}.
{ (field1, field2), (field1, field2), ... }

students_bag = { (1, 'Rajiv'), (2, 'Siddarth'), (3, 'Rajesh') };

3. Map
● A collection of key–value pairs.
● Keys are always chararray (string).
● Values can be of any type.
● Represented using []
['key1'#value1, 'key2'#value2]
student_map = ['id'#1, 'name'#'Rajiv', 'city'#'Hyderabad'];

Access values using keys:

student_map#'name' -- Returns 'Rajiv'

.
Latin Basics:

Pig Latin is a scripting language used in Apache Pig for analyzing large
datasets in Hadoop.
It is a data flow language — each statement describes how data should
be loaded, processed, and stored.
✅ Key Features of Pig Latin:
● High-level, easy-to-understand language
● Each line (statement) performs an operation on data
● Operates on relations (like tables)
● Converts Pig Latin scripts into MapReduce jobs in the background

● Pig Latin Statements

Pig Latin programs are made up of statements, which are:
● Basic building blocks of Pig scripts
● Always end with a semicolon (;)
● Work on relations (tables of data)
● Include expressions, functions, schemas, etc.
Types of Pig Statements:
Statement Purpose
LOAD Load data into Pig
STORE Save output to storage (like HDFS)
DUMP Print output to console
FILTER Remove unwanted rows
FOREACH Apply operations on each row
GROUP Group data
JOIN Join two relations
ORDER Sort data
DISTINCT Remove duplicates
Important Behavior:
Most Pig Latin statements take a relation as input and produce a
relation as output.
Exception: LOAD and STORE interact with external files.
Once a LOAD is written, semantic checking happens (type/schema
checking).
The actual data is not loaded until you run DUMP or STORE.

How Execution Works:

After writing the LOAD statement, nothing runs immediately.
When you use DUMP or STORE, the entire data flow is compiled into
MapReduce jobs and executed.
Pig Execution Modes:
1. Interactive Mode (Grunt Shell)
A command-line shell called Grunt
Allows you to write and run Pig Latin statements one by one
Use:
Best for testing, debugging, and learning
You can DUMP output to see intermediate results
Example:
pig -x local
grunt> A = LOAD 'data.txt' AS (id:int, name:chararray);
grunt> DUMP A;

2. Batch Mode (Script Execution)

Run Pig Latin scripts written in a .pig file
All statements are executed in one go
Use:
Ideal for production, automation, or scheduled jobs
More efficient than interactive mode
Example:
pig -x mapreduce myscript.pig
myscript.pig contains all Pig Latin code

3. Embedded Mode (Using UDFs)

Allows you to embed Pig Latin inside Java code
Supports writing User Defined Functions (UDFs) in Java (or Python)
Use:
Useful when you need custom logic or want to integrate Pig with Java
programs
UDFs allow advanced filtering, grouping, and calculations
Example:
Write a UDF in Java like MyUpperCaseFunction.java
Register and use it in Pig script:
REGISTER 'myudfs.jar';
A = LOAD 'names.txt' AS (name:chararray);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
Pig Processing – Loading and Transforming Data:

Apache Pig runs on top of Hadoop and is used for analyzing large
datasets stored in HDFS (Hadoop Distributed File System).
Before analyzing data, you must load it into Pig using the LOAD
operator.

What is the LOAD Operator?

The LOAD operator is used to read data from a file system (either HDFS
or Local) into Pig.
It is the first step in almost every Pig script.

relation_name = LOAD 'input_file_path'

USING function
AS schema;
Explanation of Syntax Components:

Part Description
Name of the relation (table-like object) where
relation_name the data will be stored in memory

Path to the file stored in HDFS (in MR mode)

'input_file_path' or local system (in local mode)

Function used to load data; Pig provides built-

USING function in functions

Defines the structure of data (column names

AS schema
and data types)
Common Load Functions in Pig:

Load
Description
Function

Loads data using comma as delimiter (used

PigStorage(',') most commonly)

Loads plain text data

TextLoader

Loads data in JSON format

JsonLoader

Loads data stored in binary format (used

BinStorage
internally)
(column1: data type, column2 : data type, column3 : data type);
Important Notes:
LOAD is a logical statement: It does not actually load data until a
command like DUMP or STORE is run.

Pig performs semantic checks when the LOAD statement is written

(e.g., checking schema).

Execution starts only when you use a command like DUMP or STORE.
Component Description
LOAD Reads data from HDFS or local file system

PigStorage Most commonly used loader (e.g., for CSV)

Schema Helps Pig understand data format (types, columns)

DUMP/STORE Triggers actual data loading and processing

Store Operator:
What is the STORE Operator?
The STORE operator in Pig Latin is used to save the output data to a file
system (either HDFS or local file system).
After data is processed in Pig, STORE helps to persist the result.

STORE relation_name INTO 'output_directory_path' [USING

function];
Part Description
The name of the relation (data you want
relation_name
to save)
INTO Specifies where to store the data
The HDFS or local path where data will
'output_directory_path'
be saved
Used to define the storage format (like
USING function (optional)
PigStorage, BinStorage, etc.)
After performing all the data operations (e.g., LOAD, FILTER, GROUP,
etc.), use STORE to write the final result.

Triggers execution: Just like DUMP, the STORE command executes the
data flow and generates the output files.
Example: Storing Output with PigStorage(':')
A = LOAD 'fruit_data.txt' USING PigStorage(',') AS (name:chararray,
fruit:chararray, quantity:int);
STORE A INTO 'out' USING PigStorage(':');

Output (cat out/part-m-00000):

Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7
Output is saved in the folder out and values are separated by colons
(:).
Function Description
Stores data as plain text with comma as
PigStorage(',')
separator
PigStorage(':') Stores data with colon separator
TextStorage Stores as plain text lines
Stores data in binary format (default if not
BinStorage
specified)

The STORE operator saves the processed data to HDFS or local file
system.
You can customize the output format using the USING clause.

Unlike DUMP (which displays data on console), STORE writes data

permanently.

You must ensure the output directory does not already exist in HDFS —
or it will throw an error.
Example 1 – Load CSV from HDFS:
employee = LOAD '/user/hadoop/emp.csv' USING PigStorage(',')
AS (id:int, name:chararray, salary:float);
This will load a CSV file with 3 columns into a relation called employee.

Example 2 – Load from Local File System (in local mode):

student = LOAD 'C:/data/student.txt' USING PigStorage('\t')
AS (roll:int, name:chararray, marks:int);
Loads tab-separated data from a local file.
Pig Built-In Functions:
Eval Functions
Eval functions in Pig are built-in functions that help in performing
calculations and data transformations on bags, tuples, or fields.

What is a Bag in Pig?

A Bag is a collection of tuples (similar to rows in a table).

Think of it like a grouped dataset where eval functions are applied for
aggregation or comparison.

1. AVG(col)
Purpose: Computes the average (mean) of numerical values in a
column within a bag.
Usage: Mostly used after GROUP BY. Excludes nulls.
Example:
A = GROUP data BY department;
B = FOREACH A GENERATE group, AVG(data.salary);
2. CONCAT(string1, string2)
Purpose: Joins two string expressions (must be of same type).
Output: A single combined string.
Example:A = FOREACH data GENERATE CONCAT(first_name,
last_name);

3. COUNT(bag)
Purpose: Counts the number of non-null tuples in a bag.
Ignores null values.
Example:
A = GROUP data BY dept; B = FOREACH A GENERATE group,
COUNT(data);

4. COUNT_STAR(bag)
Purpose: Counts all elements including nulls in a bag.
Useful when nulls must be included in analysis.
Example:
A = GROUP data BY dept; B = FOREACH A GENERATE group,
5. DIFF(bag1, bag2)
Purpose: Compares two bags and returns elements that are different
(present in one but not the other).
Acts like set difference.
Example:
-- bag1 and bag2 are grouped relations result = DIFF(bag1, bag2);

6. IsEmpty(DataBag bag) / IsEmpty(Map map)

Purpose: Checks whether a bag or a map is empty.
Returns: true if the structure is empty, otherwise false.
Example: A = FILTER data BY IsEmpty(data.bag_column);

7. MAX(col)
Purpose: Returns the maximum value in a single-column bag (number
or character). Ignores nulls.
Example:A = GROUP sales BY region;
B = FOREACH A GENERATE group, MAX(sales.amount);
8. MIN(col)
Purpose: Returns the minimum value in a single-column bag.
Works like MAX but gives the lowest value.
Example: A = GROUP sales BY region;
B = FOREACH A GENERATE group, MIN(sales.amount);
9. DEFINE pluck pluckTuple(expression)
Purpose: Filters and selects columns that start with a specific prefix
string.
pluckTuple is a user-defined function (UDF) that filters and selects only
those columns whose names start with a specific prefix string.
Helps to dynamically select fields based on naming pattern.
Example:
DEFINE pluck pluckTuple('emp_'); B = FOREACH A GENERATE pluck(*);
10. SIZE(expression)
Purpose: Returns the number of elements in: a tuple, a bag, a map, a
string (returns length)
Example:A = FOREACH data GENERATE SIZE(name); -- Length of string B
= FOREACH group_data GENERATE SIZE(bag_column); Number of tuples
11. SUBTRACT(bag1, bag2)
Purpose: Returns a bag containing elements in bag1 that are not in
bag2.
Works like a set difference.
Example: result = SUBTRACT(bagA, bagB);

12. SUM(col)
Purpose: Adds all the numerical values in a single-column bag.
Ignores null values.
Example: A = GROUP sales BY region; B = FOREACH A GENERATE group,
SUM(sales.amount);

13. TOKENIZE(string [, delimiter])

Purpose: Splits a string into words or tokens based on a delimiter
(default is space).
Returns a bag of words.
Example:
A = FOREACH data GENERATE TOKENIZE(description) AS words;
Filtering:
The FILTER operator is used to extract only the rows (tuples) from a
relation that match a condition.
Think of it like the WHERE clause in SQL.

Syntax of FILTER:
new_relation = FILTER old_relation BY (condition);

Example Dataset – StudentInfo.txt

1, Rajiv, Reddy, 21, 9848022337, Hyderabad
2, Siddarth, Battacharya, 22, 9848022338, Kolkata
3, Rajesh, Khanna, 22, 9848022339, Delhi
4, Preethi, Agarwal, 21, 9848022330, Pune
5, Trupthi, Mohanthy, 23, 9848022336, Bhubaneswar
6, Archana, Mishra, 23, 9848022335, Chennai
7, Komal, Nayak, 24, 9848022334, Trivandrum
8, Bharathi, Nambiayar, 24, 9848022333, Chennai
Step-by-Step Code Example:
1. Load the file:
student_details =
LOAD 'hdfs://localhost:9000/pig_data/StudentInfo.txt'
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
This loads the student records and assigns proper schema (columns
with data types).

2. Apply Filter – Students from Chennai:

filter_data = FILTER student_details BY city == 'Chennai';
This extracts only the rows where the city column is equal to 'Chennai'.
Feature Description
Purpose Selects tuples based on a condition
Condition Can use ==, <, >, !=, AND, OR
Data Type Works on int, chararray, float, etc.
Output A new relation with filtered data
Grouping and Sorting:
Grouping in Pig
The GROUP operator is used to group data in a relation based on a
specific field (key).
It is similar to GROUP BY in SQL.
After grouping, you can apply aggregation functions like COUNT, AVG,
SUM, etc.
Syntax:
grouped_data = GROUP relation_name BY field_name;

Example: Group Students by Age

student_details = LOAD
'hdfs://localhost:9000/pig_data/StudentInfo.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray); grouped_data = GROUP
student_details BY age;

This will group all students who have the same age into one group.
Sorting in Pig

The ORDER BY operator is used to sort the data in ascending (ASC) or

descending (DESC) order based on one or more fields.
Like ORDER BY in SQL.
Syntax: sorted_data = ORDER relation_name BY field_name ASC;
Or
sorted_data = ORDER relation_name BY field_name DESC;
Example: Sort Students by Age
sorted_data = ORDER student_details BY age ASC;
This will sort all student records by age in ascending order.

Example: Sort Students by City in Descending Order

sorted_data = ORDER student_details BY city DESC;
What is Pig Latin?
Pig Latin is the scripting language used in Apache Pig to process and
analyze large datasets.

It supports data flow programming and resembles SQL, but it’s

procedural (step-by-step instructions).

Pig Latin runs on Hadoop and converts scripts into MapReduce jobs
internally.

Pig Latin is a high-level scripting language used with Apache

Pig to analyze large datasets on Hadoop.
1. LOAD
Loads data from the file system into a relation.
relation = LOAD 'file_path' USING PigStorage(',')
AS (field1:type, field2:type, ...);
Example:
student_details = LOAD
'hdfs://localhost:9000/pig_data/StudentInfo.txt'
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
2. STORE
Saves the results of a relation into HDFS or local file system.
STORE relation INTO 'output_path' USING PigStorage(',');

3. DUMP
Displays the contents of a relation on the console.
DUMP student_details;

4. FILTER
Filters records based on a condition (like WHERE in SQL).
5. FOREACH … GENERATE
Projects (selects) required columns or applies transformations.
names = FOREACH student_details GENERATE firstname, city;

6. GROUP
Groups records by a field (like GROUP BY in SQL).
grouped_data = GROUP student_details BY age;

7. COGROUP
Groups multiple relations by a common field.
COGROUP groups two (or more) relations on a common key.
cogrouped = COGROUP relation1 BY id, relation2 BY id;

8. JOIN
Joins two or more relations based on a condition.
joined = JOIN relation1 BY id, relation2 BY id;
9. CROSS
Cartesian product (every row of one relation with every row of
another).
crossed = CROSS relation1, relation2;

10. UNION
Combines data from two or more relations (must have same
schema).
combined = UNION relation1, relation2;

11. DISTINCT
Removes duplicate tuples.
unique_cities = DISTINCT student_details;

12. ORDER
Sorts the relation by one or more fields.
sorted = ORDER student_details BY age ASC;
13. LIMIT
Limits the number of output records.
top5 = LIMIT student_details 5;

14. SAMPLE
Extracts a sample of data.
sample_data = SAMPLE student_details 0.3;
(30% random sample)

15. DESCRIBE
Shows the schema of a relation.
DESCRIBE student_details;

16. ILLUSTRATE
Shows how Pig transforms the data step by step.
ILLUSTRATE student_details;
17. EXPLAIN

Displays the logical, physical, and MapReduce execution plan.

EXPLAIN student_details;
With these commands, you can perform loading, transformation,
filtering, grouping, joining, sorting, and saving results in Pig.

Applications of Apache Pig in Big Data
No ratings yet
Applications of Apache Pig in Big Data
10 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
? What Is Pig
No ratings yet
? What Is Pig
9 pages
Unit-3 Bda
No ratings yet
Unit-3 Bda
107 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
3 Pig
No ratings yet
3 Pig
77 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Topic Name: Introduction To Pig: Department of Computer Science and Engineering
No ratings yet
Topic Name: Introduction To Pig: Department of Computer Science and Engineering
30 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
BDP U4
No ratings yet
BDP U4
58 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Unit 5
No ratings yet
Unit 5
24 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Pig Hive
No ratings yet
Pig Hive
64 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
IMTC634 - Data Science - Chapter 16
No ratings yet
IMTC634 - Data Science - Chapter 16
20 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Apache Pig & Pig Latin Overview
No ratings yet
Apache Pig & Pig Latin Overview
41 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Notes
No ratings yet
Notes
19 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Understanding Apache Pig for Data Analysis
No ratings yet
Understanding Apache Pig for Data Analysis
6 pages
Understanding Apache Pig Architecture
No ratings yet
Understanding Apache Pig Architecture
33 pages
Pig 2
No ratings yet
Pig 2
63 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
6 Part2
No ratings yet
6 Part2
45 pages
Practical-1 Mysql Installation & Introduction
No ratings yet
Practical-1 Mysql Installation & Introduction
10 pages
Introduction To ADO
No ratings yet
Introduction To ADO
3 pages
Sapnote-69370 INFO-Clearing Data Inconsistencies in CO-PA
No ratings yet
Sapnote-69370 INFO-Clearing Data Inconsistencies in CO-PA
5 pages
IT Tools Week 12 - Using Advanced Functions and Conditional Formatting
No ratings yet
IT Tools Week 12 - Using Advanced Functions and Conditional Formatting
21 pages
Overview of Data Warehouse Concepts
No ratings yet
Overview of Data Warehouse Concepts
14 pages
1622 DDD GCS200093 NguyenDuyKhang Assignment-2 Resubmit
No ratings yet
1622 DDD GCS200093 NguyenDuyKhang Assignment-2 Resubmit
38 pages
HOW To Analyzing and Interpreting AWR Report
0% (1)
HOW To Analyzing and Interpreting AWR Report
1 page
Thapar Dispensary Management System
No ratings yet
Thapar Dispensary Management System
19 pages
MuleSoft Developer Questionnaire
No ratings yet
MuleSoft Developer Questionnaire
11 pages
Assignment 2 - Database Management System - Docx-1736920487864
No ratings yet
Assignment 2 - Database Management System - Docx-1736920487864
3 pages
Clean Architecture Example
No ratings yet
Clean Architecture Example
7 pages
DBMS Microproject Report
No ratings yet
DBMS Microproject Report
10 pages
MidTerm Intermediate Database
No ratings yet
MidTerm Intermediate Database
24 pages
Database Systems Final Exam Guide
No ratings yet
Database Systems Final Exam Guide
4 pages
Salesforce Certified Data Architect Dumps by Moon 24-05-2024 11qa Dumpssheet
No ratings yet
Salesforce Certified Data Architect Dumps by Moon 24-05-2024 11qa Dumpssheet
12 pages
Distributed Database Word
No ratings yet
Distributed Database Word
29 pages
Stocks Analysis and Prediction Using Big Data Analytics
No ratings yet
Stocks Analysis and Prediction Using Big Data Analytics
4 pages
CMS Production Deployment Guide
No ratings yet
CMS Production Deployment Guide
9 pages
SQLcl vs SQLPlus: Features & Benefits
No ratings yet
SQLcl vs SQLPlus: Features & Benefits
113 pages
Database Management in Accounting Systems
No ratings yet
Database Management in Accounting Systems
5 pages
Software Developer Resume - Sudhakar Vaddi
100% (2)
Software Developer Resume - Sudhakar Vaddi
5 pages
INT 306 Database Management Systems (DBMS) : Let's Move Toward The Better Way To Store and Manage The Data'
No ratings yet
INT 306 Database Management Systems (DBMS) : Let's Move Toward The Better Way To Store and Manage The Data'
17 pages
SAP ABAP For HANA Sample Technical Specification PDF
No ratings yet
SAP ABAP For HANA Sample Technical Specification PDF
32 pages
Data Engineering SQL Top 100 Questions With Answers
No ratings yet
Data Engineering SQL Top 100 Questions With Answers
297 pages
Database အေၾကာင္း
100% (1)
Database အေၾကာင္း
3 pages
Final Dta Base Assignment No 3 Data Base
No ratings yet
Final Dta Base Assignment No 3 Data Base
15 pages
DBMS Course Handout Spring 2025
No ratings yet
DBMS Course Handout Spring 2025
5 pages
1.4.2 Linked Lists, Trees and Graphs
No ratings yet
1.4.2 Linked Lists, Trees and Graphs
12 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
Excel Data Analysis For Dummies 2nd Edition Stephen L. Nelson Download
No ratings yet
Excel Data Analysis For Dummies 2nd Edition Stephen L. Nelson Download
52 pages

Module4 Pig Notes

Uploaded by

Module4 Pig Notes

Uploaded by

PIG

What is Apache Pig?

Why Use Pig?

Language Java-based Scripting (Pig Latin)

Development Long Short

● Faster development cycle compared to MapReduce

● Supports sampling & debugging tools for quick testing

● Extensible with User Defined Functions (UDFs)

● The Parser reads Pig scripts written in Pig Latin.

5. Pig Execution Modes

B. MapReduce Mode (MR Mode)

students_bag = { (1, 'Rajiv'), (2, 'Siddarth'), (3, 'Rajesh') };

Access values using keys:

● Pig Latin Statements

How Execution Works:

2. Batch Mode (Script Execution)

3. Embedded Mode (Using UDFs)

What is the LOAD Operator?

relation_name = LOAD 'input_file_path'

Path to the file stored in HDFS (in MR mode)

Function used to load data; Pig provides built-

Defines the structure of data (column names

Loads data using comma as delimiter (used

Loads plain text data

Loads data in JSON format

Loads data stored in binary format (used

Pig performs semantic checks when the LOAD statement is written

PigStorage Most commonly used loader (e.g., for CSV)

Schema Helps Pig understand data format (types, columns)

DUMP/STORE Triggers actual data loading and processing

STORE relation_name INTO 'output_directory_path' [USING

Output (cat out/part-m-00000):

Unlike DUMP (which displays data on console), STORE writes data

Example 2 – Load from Local File System (in local mode):

What is a Bag in Pig?

6. IsEmpty(DataBag bag) / IsEmpty(Map map)

13. TOKENIZE(string [, delimiter])

Example Dataset – StudentInfo.txt

2. Apply Filter – Students from Chennai:

Example: Group Students by Age

The ORDER BY operator is used to sort the data in ascending (ASC) or

Example: Sort Students by City in Descending Order

It supports data flow programming and resembles SQL, but it’s

Pig Latin is a high-level scripting language used with Apache

Displays the logical, physical, and MapReduce execution plan.

You might also like