0% found this document useful (0 votes)

29 views6 pages

Pig Notes-1

Apache Pig simplifies the processing of large datasets by providing a higher level of abstraction compared to MapReduce, utilizing a language called Pig Latin and an execution environment that can run locally or on a Hadoop cluster. Pig allows for rich data structures and powerful transformations, enabling programmers to focus on data rather than execution details. It supports various execution methods, including scripts, an interactive shell called Grunt, and embedded Java programs, making it versatile for data analysis tasks.

Uploaded by

arunsjoseph5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Pig Notes-1

Uploaded by

arunsjoseph5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Apache Pig raises the level of abstraction for processing large datasets.

MapReduce allows us, as

the programmer, to specify a map function followed by a reduce function, but working out how
to fit your data processing into this pattern, which often requires multiple MapReduce stages, can
be a challenge. With Pig, the data structures are much richer, typically being multivalued and
nested, and the transformations we can apply to the data are much more powerful.
Pig is made up of two pieces:
 The language used to express data flows, called Pig Latin.
 The execution environment to run Pig Latin programs. There are currently two environments:
local execution in a single JVM and distributed execution on a Hadoopcluster.

A Pig Latin program is made up of a series of operations, or transformations, that are applied to
the input data to produce output. Taken as a whole, the operations describe a data flow, which
the Pig execution environment translates into an executable representation and then runs. Under
the covers, Pig turns the transformations into a series of MapReduce jobs, but as a programmer
we are mostly unaware of this, which allows us to focus on the data rather than the nature of the
execution.
Pig is a scripting language for exploring large datasets. Pig is very supportive of a programmer
writing a query, since it provides several commands for introspecting the data structures in our
program as it is written. Even more useful, it can perform a sample run on a representative subset
of our input data, so we can see whether there are errors in the processing before unleashing it on
the full dataset.
Installing and Running Pig
Pig runs as a client-side application. Even if we want to run Pig on a Hadoop cluster, there is
nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other
Hadoop files systems) from our workstation.
Installation is straightforward. Download a stable release from
http://pig.apache.org/releases.html, and unpack the tar ball in a suitable place on your
workstation:
% tar xzf pig-x.y.z.tar.gz
It is convenient to add Pig’s binary directory to our command-line path. For example:
% export PIG_HOME=~/sw/pig-x.y.z
% export PATH=$PATH:$PIG_HOME/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java
installation.
Try typing pig -help to get usage instructions.

Execution Types
Pig has two execution types or modes: local mode and MapReduce mode.
Local mode
In local mode, Pig runs in a single JVM and accesses the local file system. This mode is suitable
only for small datasets and when trying out Pig.
The execution type is set using the -x or -exectype option. To run in local mode, set the option to
local:
% pig -x local
grunt>
This starts Grunt, the Pig interactive shell, which is discussed in more detail shortly.
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce mode (with a fully
distributed cluster) is what we use when you want to run Pig on large datasets.

To use MapReduce mode, we first need to check that the version of Pig we downloaded is
compatible with the version of Hadoop we are using. Pig releases will only work against
particular versions of Hadoop; this is documented in the release notes.
It is required to set the HADOOP_HOME environment variable as it is needed to find which
Hadoop client to run.
Next, you need to point Pig at the cluster’s namenode and resource manager.

We need to set the properties in the pig.properties file in Pig’s conf directory (or the directory
specified by PIG_CONF_DIR). Here’s an example for a pseudodistributed setup:
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032

Once we have configured Pig to connect to a Hadoop cluster, we can launch Pig, setting the -x
option to mapreduce or omitting it entirely, as MapReduce mode is the default.

We’ve used the -brief option to stop timestamps from being logged:
% pig -brief
Logging error messages to: /Users/tom/pig_1414246949680.log
Default bootup file /Users/tom/.pigbootup not found

Connecting to hadoop file system at: hdfs://localhost/

grunt>
As we can see from the output, Pig reports the filesystem (but not the YARN resource manager)
that it has connected to.

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and MapReduce
mode:
Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the
commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e
option to run a script specified as a string on the command line.
Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified
for Pig to run and the -e option is not used. It is also possible to run Pig scripts from within Grunt
using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to
run SQL programs from Java. For programmatic access to Grunt, use PigRunner.
Grunt
Grunt has line-editing facilities for instance, the Ctrl-E key combination will move the cursor to
the end of the line. Grunt remembers command history, too, and we can recall lines in the history
buffer using Ctrl-P or Ctrl-N (for previous and next), or equivalently, the up or down cursor
keys.
Another handy feature is Grunt’s completion mechanism, which will try to complete Pig Latin
keywords and functions when you press the Tab key. For example, consider the following
incomplete line:
grunt> a = foreach b ge
If we press the Tab key at this point, ge will expand to generate, a Pig Latin keyword:
grunt> a = foreach b generate

Pig Latin Editors

There are Pig Latin syntax highlighters available for a variety of editors, including Eclipse,
IntelliJ IDEA, Vim, Emacs, and TextMate. Details are available on the Pig wiki. Many Hadoop
distributions come with the Hue web interface, which has a Pig script editor and launcher.

An Example
Let’s look at a simple example by writing the program to calculate the maximum recorded
temperature by year for the weather dataset in Pig Latin (just like we did using MapReduce). The
complete program is only a few lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;

To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter lines and
interact with the program to understand what it’s doing. Startup Grunt in local mode, and then
enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);

For simplicity, the program assumes that the input is tab-delimited text, with each line having
just year, temperature, and quality fields. The year:chararray notation describes the field’s name
and type; chararray is like a Java String, and an int is like a Java int. The LOAD operator takes a
URI argument; here we are just using a local file, but we could refer to an HDFS URI. The AS
clause (which is optional) gives the fields names to make it convenient to refer to them in
subsequent statements.

The result of the LOAD operator, and indeed any operator in Pig Latin, is a relation, which is
just a set of tuples. A tuple is just like a row of data in a database table, with multiple fields in a
particular order. In this example, the LOAD function produces a set of (year, temperature,
quality) tuples that are present in the input file. We write a relation with one tuple per line, where
tuples are represented as comma-separated items in parentheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
Relations are given names, or aliases, so they can be referred to. This relation is given the
records alias. We can examine the contents of an alias using the DUMP operator:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

We can also see the structure of a relation—the relation’s schema—using the DESCRIBE
operator on the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has three fields, with aliases year, temperature, and quality, which are
the names we gave them in the AS clause. The fields have the types given to them in the AS
clause, too.

The second statement removes records that have a missing temperature (indicated by a value of
9999) or an unsatisfactory quality reading. For this small dataset, no records are filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> quality IN (0, 1, 4, 5, 9);
grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

The third statement uses the GROUP function to group the records relation by the year field.
Let’s use DUMP to see what it produces:
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,78,1),(1949,111,1)})
(1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})

We now have two rows, or tuples: one for each year in the input data. The first field in each tuple
is the field being grouped by (the year), and the second field has a bag of tuples for that year. A
bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.
By grouping the data in this way, we have created a row per year, so now all that remains is to
find the maximum temperature for the tuples in each bag. Before we do this, let’s understand the
structure of the grouped_records relation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}}

This tells us that the grouping field is given the alias group by Pig, and the second field is the
same structure as the filtered_records relation that was being grouped. With this information, we
can try the fourth transformation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to
define the fields in each derived row. In this example, the first field is group, which is just the
year.

The second field is a little more complex.

The filtered_records.temperature reference is to the temperature field of the filtered_records bag
in the grouped_records relation. MAX is a built-in function for calculating the maximum value
of fields in a bag. In this case, it calculates the maximum temperature for the fields in each
filtered_records bag. Let’s check the result:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
We’ve successfully calculated the maximum temperature for each year.

Generating Examples
In this example, we’ve used a small sample dataset with just a handful of rows to make
it easier to follow the data flow and aid debugging.

With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete
and concise sample dataset.
grunt> ILLUSTRATE max_temp;
Notice that Pig used some of the original data (this is important to keep the generated dataset
realistic), as well as creating some new data. It noticed the special value 9999 in the query and
created a tuple containing this value to exercise the FILTER statement.

Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
5 pages
Unit III
No ratings yet
Unit III
118 pages
Pig
No ratings yet
Pig
16 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Unit Iv Part - 2
No ratings yet
Unit Iv Part - 2
59 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Understanding Apache Pig Architecture
No ratings yet
Understanding Apache Pig Architecture
33 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Unit 5
No ratings yet
Unit 5
24 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig: Data Processing Guide
No ratings yet
Apache Pig: Data Processing Guide
12 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Apache Pig: For Live Hadoop Training, Please See Courses
No ratings yet
Apache Pig: For Live Hadoop Training, Please See Courses
25 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Pig 2
No ratings yet
Pig 2
63 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Apache Pig
100% (2)
Apache Pig
80 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
Apache Pig Guide: Features & Functions
No ratings yet
Apache Pig Guide: Features & Functions
31 pages
BD 5
No ratings yet
BD 5
28 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Pig Data Types and Features Overview
No ratings yet
Pig Data Types and Features Overview
16 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
PIG
No ratings yet
PIG
9 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
3 Pig
No ratings yet
3 Pig
77 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
6 Part2
No ratings yet
6 Part2
45 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Unit 3
No ratings yet
Unit 3
26 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
BDTools PIG
No ratings yet
BDTools PIG
14 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
Unit5 Bigdatanotes
No ratings yet
Unit5 Bigdatanotes
52 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Analysis Designoffoundationfor G5 Residential Building
No ratings yet
Analysis Designoffoundationfor G5 Residential Building
7 pages
Design of Potentiostat and Current Mode Read-Out Amplifier For Glucose Sensing
No ratings yet
Design of Potentiostat and Current Mode Read-Out Amplifier For Glucose Sensing
6 pages
Understanding Forces in Motion
No ratings yet
Understanding Forces in Motion
4 pages
ch03 Theory Building
No ratings yet
ch03 Theory Building
4 pages
16.5.1 Packet Tracer Secure Network Devices IZt
No ratings yet
16.5.1 Packet Tracer Secure Network Devices IZt
2 pages
Heavy Equipment Tools & Uses Guide
No ratings yet
Heavy Equipment Tools & Uses Guide
68 pages
Circle Theorems and Properties Explained
No ratings yet
Circle Theorems and Properties Explained
10 pages
Solution
No ratings yet
Solution
35 pages
Bioengineering 10 00798
No ratings yet
Bioengineering 10 00798
22 pages
Solution Prep & Chemistry Concepts
No ratings yet
Solution Prep & Chemistry Concepts
18 pages
Pile Lateral Load Analysis Using Finite Difference Method Latpile Function
No ratings yet
Pile Lateral Load Analysis Using Finite Difference Method Latpile Function
29 pages
Motion Detection in Video
No ratings yet
Motion Detection in Video
6 pages
SmartPLS for Structural Equation Modeling
No ratings yet
SmartPLS for Structural Equation Modeling
15 pages
MCQs Amplifiers
No ratings yet
MCQs Amplifiers
5 pages
Ortho 1
No ratings yet
Ortho 1
17 pages
MCQ Gate 2015
50% (2)
MCQ Gate 2015
34 pages
Taparia Price List 2021
100% (1)
Taparia Price List 2021
28 pages
Saudi Standards, Metrology and Quality Org (SASO)
No ratings yet
Saudi Standards, Metrology and Quality Org (SASO)
13 pages
80 PLUS Efficiency Report for Power Supplies
No ratings yet
80 PLUS Efficiency Report for Power Supplies
1 page
Aircraft Mass and Balance Fundamentals
100% (4)
Aircraft Mass and Balance Fundamentals
4 pages
Solutia Air RPT
No ratings yet
Solutia Air RPT
57 pages
Hs Science
No ratings yet
Hs Science
2 pages
Physics Multiple Choice Questions
No ratings yet
Physics Multiple Choice Questions
30 pages
Grade 1 Mathematics Exam Paper
No ratings yet
Grade 1 Mathematics Exam Paper
7 pages
NoC Router Fault Detection Method
No ratings yet
NoC Router Fault Detection Method
6 pages
IW Quarter-Turn - Metric
No ratings yet
IW Quarter-Turn - Metric
6 pages
Braking Force Calculation Spreadsheet
No ratings yet
Braking Force Calculation Spreadsheet
4 pages
Jayarathinam Resume
No ratings yet
Jayarathinam Resume
1 page
FAA Review Questions on Nondestructive Testing
No ratings yet
FAA Review Questions on Nondestructive Testing
12 pages
Experiment 6 ECAD
No ratings yet
Experiment 6 ECAD
7 pages

Pig Notes-1

Uploaded by

Pig Notes-1

Uploaded by

Apache Pig raises the level of abstraction for processing large datasets.

MapReduce allows us, as

Connecting to hadoop file system at: hdfs://localhost/

Running Pig Programs

Pig Latin Editors

The second field is a little more complex.

You might also like