Apache Pig raises the level of abstraction for processing large datasets.
MapReduce allows us, as
the programmer, to specify a map function followed by a reduce function, but working out how
to fit your data processing into this pattern, which often requires multiple MapReduce stages, can
be a challenge. With Pig, the data structures are much richer, typically being multivalued and
nested, and the transformations we can apply to the data are much more powerful.
Pig is made up of two pieces:
The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs. There are currently two environments:
local execution in a single JVM and distributed execution on a Hadoopcluster.
A Pig Latin program is made up of a series of operations, or transformations, that are applied to
the input data to produce output. Taken as a whole, the operations describe a data flow, which
the Pig execution environment translates into an executable representation and then runs. Under
the covers, Pig turns the transformations into a series of MapReduce jobs, but as a programmer
we are mostly unaware of this, which allows us to focus on the data rather than the nature of the
execution.
Pig is a scripting language for exploring large datasets. Pig is very supportive of a programmer
writing a query, since it provides several commands for introspecting the data structures in our
program as it is written. Even more useful, it can perform a sample run on a representative subset
of our input data, so we can see whether there are errors in the processing before unleashing it on
the full dataset.
Installing and Running Pig
Pig runs as a client-side application. Even if we want to run Pig on a Hadoop cluster, there is
nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other
Hadoop files systems) from our workstation.
Installation is straightforward. Download a stable release from
http://pig.apache.org/releases.html, and unpack the tar ball in a suitable place on your
workstation:
% tar xzf pig-x.y.z.tar.gz
It is convenient to add Pig’s binary directory to our command-line path. For example:
% export PIG_HOME=~/sw/pig-x.y.z
% export PATH=$PATH:$PIG_HOME/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java
installation.
Try typing pig -help to get usage instructions.
Execution Types
Pig has two execution types or modes: local mode and MapReduce mode.
Local mode
In local mode, Pig runs in a single JVM and accesses the local file system. This mode is suitable
only for small datasets and when trying out Pig.
The execution type is set using the -x or -exectype option. To run in local mode, set the option to
local:
% pig -x local
grunt>
This starts Grunt, the Pig interactive shell, which is discussed in more detail shortly.
MapReduce mode
In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop
cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce mode (with a fully
distributed cluster) is what we use when you want to run Pig on large datasets.
To use MapReduce mode, we first need to check that the version of Pig we downloaded is
compatible with the version of Hadoop we are using. Pig releases will only work against
particular versions of Hadoop; this is documented in the release notes.
It is required to set the HADOOP_HOME environment variable as it is needed to find which
Hadoop client to run.
Next, you need to point Pig at the cluster’s namenode and resource manager.
We need to set the properties in the pig.properties file in Pig’s conf directory (or the directory
specified by PIG_CONF_DIR). Here’s an example for a pseudodistributed setup:
fs.defaultFS=hdfs://localhost/
mapreduce.framework.name=yarn
yarn.resourcemanager.address=localhost:8032
Once we have configured Pig to connect to a Hadoop cluster, we can launch Pig, setting the -x
option to mapreduce or omitting it entirely, as MapReduce mode is the default.
We’ve used the -brief option to stop timestamps from being logged:
% pig -brief
Logging error messages to: /Users/tom/pig_1414246949680.log
Default bootup file /Users/tom/.pigbootup not found
Connecting to hadoop file system at: hdfs://localhost/
grunt>
As we can see from the output, Pig reports the filesystem (but not the YARN resource manager)
that it has connected to.
Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and MapReduce
mode:
Script
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the
commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e
option to run a script specified as a string on the command line.
Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified
for Pig to run and the -e option is not used. It is also possible to run Pig scripts from within Grunt
using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can use JDBC to
run SQL programs from Java. For programmatic access to Grunt, use PigRunner.
Grunt
Grunt has line-editing facilities for instance, the Ctrl-E key combination will move the cursor to
the end of the line. Grunt remembers command history, too, and we can recall lines in the history
buffer using Ctrl-P or Ctrl-N (for previous and next), or equivalently, the up or down cursor
keys.
Another handy feature is Grunt’s completion mechanism, which will try to complete Pig Latin
keywords and functions when you press the Tab key. For example, consider the following
incomplete line:
grunt> a = foreach b ge
If we press the Tab key at this point, ge will expand to generate, a Pig Latin keyword:
grunt> a = foreach b generate
Pig Latin Editors
There are Pig Latin syntax highlighters available for a variety of editors, including Eclipse,
IntelliJ IDEA, Vim, Emacs, and TextMate. Details are available on the Pig wiki. Many Hadoop
distributions come with the Hue web interface, which has a Pig script editor and launcher.
An Example
Let’s look at a simple example by writing the program to calculate the maximum recorded
temperature by year for the weather dataset in Pig Latin (just like we did using MapReduce). The
complete program is only a few lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
quality IN (0, 1, 4, 5, 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter lines and
interact with the program to understand what it’s doing. Startup Grunt in local mode, and then
enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);
For simplicity, the program assumes that the input is tab-delimited text, with each line having
just year, temperature, and quality fields. The year:chararray notation describes the field’s name
and type; chararray is like a Java String, and an int is like a Java int. The LOAD operator takes a
URI argument; here we are just using a local file, but we could refer to an HDFS URI. The AS
clause (which is optional) gives the fields names to make it convenient to refer to them in
subsequent statements.
The result of the LOAD operator, and indeed any operator in Pig Latin, is a relation, which is
just a set of tuples. A tuple is just like a row of data in a database table, with multiple fields in a
particular order. In this example, the LOAD function produces a set of (year, temperature,
quality) tuples that are present in the input file. We write a relation with one tuple per line, where
tuples are represented as comma-separated items in parentheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
Relations are given names, or aliases, so they can be referred to. This relation is given the
records alias. We can examine the contents of an alias using the DUMP operator:
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
We can also see the structure of a relation—the relation’s schema—using the DESCRIBE
operator on the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has three fields, with aliases year, temperature, and quality, which are
the names we gave them in the AS clause. The fields have the types given to them in the AS
clause, too.
The second statement removes records that have a missing temperature (indicated by a value of
9999) or an unsatisfactory quality reading. For this small dataset, no records are filtered out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> quality IN (0, 1, 4, 5, 9);
grunt> DUMP filtered_records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
The third statement uses the GROUP function to group the records relation by the year field.
Let’s use DUMP to see what it produces:
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,78,1),(1949,111,1)})
(1950,{(1950,-11,1),(1950,22,1),(1950,0,1)})
We now have two rows, or tuples: one for each year in the input data. The first field in each tuple
is the field being grouped by (the year), and the second field has a bag of tuples for that year. A
bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.
By grouping the data in this way, we have created a row per year, so now all that remains is to
find the maximum temperature for the tuples in each bag. Before we do this, let’s understand the
structure of the grouped_records relation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}}
This tells us that the grouping field is given the alias group by Pig, and the second field is the
same structure as the filtered_records relation that was being grouped. With this information, we
can try the fourth transformation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to
define the fields in each derived row. In this example, the first field is group, which is just the
year.
The second field is a little more complex.
The filtered_records.temperature reference is to the temperature field of the filtered_records bag
in the grouped_records relation. MAX is a built-in function for calculating the maximum value
of fields in a bag. In this case, it calculates the maximum temperature for the fields in each
filtered_records bag. Let’s check the result:
grunt> DUMP max_temp;
(1949,111)
(1950,22)
We’ve successfully calculated the maximum temperature for each year.
Generating Examples
In this example, we’ve used a small sample dataset with just a handful of rows to make
it easier to follow the data flow and aid debugging.
With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete
and concise sample dataset.
grunt> ILLUSTRATE max_temp;
Notice that Pig used some of the original data (this is important to keep the generated dataset
realistic), as well as creating some new data. It noticed the special value 9999 in the query and
created a tuple containing this value to exercise the FILTER statement.