0% found this document useful (0 votes)

122 views25 pages

Apache Pig: For Live Hadoop Training, Please See Courses

Apache pig is an abstraction on top of Hadoop - Converted into map and reduce and executed on Hadoop Clusters. Pig is widely accepted and used - Yahoo!, twitter, Netflix, etc.

Uploaded by

Almase

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views25 pages

Apache Pig: For Live Hadoop Training, Please See Courses

Apache pig is an abstraction on top of Hadoop - Converted into map and reduce and executed on Hadoop Clusters. Pig is widely accepted and used - Yahoo!, twitter, Netflix, etc.

Uploaded by

Almase

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2012 coreservlets.

com and Dima May

Apache Pig
Originals of Slides and Source Code for Examples: http://www.coreservlets.com/hadoop-tutorial/
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.

Developed and taught by well-known author and developer. At public venues or onsite at your location.

2012 coreservlets.com and Dima May

For live Hadoop training, please see courses at http://courses.coreservlets.com/.

Taught by the author of this Hadoop tutorial. Available at public venues, or customized versions can be held on-site at your organization.
Courses developed and taught by Marty Hall Courses developed and taught by coreservlets.com experts (edited by Marty)
JSF 2, PrimeFaces, servlets/JSP, Ajax, jQuery, Android development, Java 6 or 7 programming, custom mix of topics Ajax courses can concentrate on 1EE library (jQuery, Prototype/Scriptaculous, Ext-JS, Dojo, etc.) or survey several Customized Java Training: http://courses.coreservlets.com/

Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android. Hadoop, Spring, Hibernate/JPA, GWT, SOAP-based and RESTful Web Services
Contact [email protected] for details Developed and taught by well-known author and developer. At public venues or onsite at your location.

Agenda
Pig Overview Execution Modes Installation Pig Latin Basics Developing Pig Script
Most Occurred Start Letter

Resources

Pig
is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Top Level Apache Project

http://pig.apache.org

Pig is an abstraction on top of Hadoop

Provides high level programming language designed for data processing Converted into MapReduce and executed on Hadoop Clusters

Pig is widely accepted and used

Yahoo!, Twitter, Netflix, etc...
5

Pig and MapReduce

MapReduce requires programmers
Must think in terms of map and reduce functions More than likely will require Java programmers

Pig provides high-level language that can be used by

Analysts Data Scientists Statisticians Etc...

Originally implemented at Yahoo! to allow analysts to access data

Pigs Features
Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions Etc..

Pigs Use Cases

Extract Transform Load (ETL)
Ex: Processing large amounts of log data
clean bad entries, join with other data-sets

Research of raw information

Ex. User Audit Logs Schema maybe unknown or inconsistent Data Scientists and Analysts may like Pigs data transformation paradigm

Pig Components
Pig Latin
Command based language Designed specifically for data transformation and flow expression

Execution Environment
The environment in which Pig Latin commands are executed Currently there is support for Local and Hadoop modes

Pig compiler converts Pig Latin to MapReduce

Compiler strives to optimize execution You automatically get optimization improvements with Pig updates

Execution Modes
Local
Executes in a single JVM Works exclusively with local file system Great for development, experimentation and prototyping

Hadoop Mode
Also known as MapReduce mode Pig renders Pig Latin into MapReduce jobs and executes them on the cluster Can execute against semi-distributed or fully-distributed hadoop installation
We will run on semi-distributed cluster

Hadoop Mode
-- 1: Load text into a bag, where a row is a line of text lines = LOAD '/training/playArea/hamlet.txt' AS (line:chararray); -- 2: Tokenize the provided text tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;

PigLatin.pig
Execute on Hadoop Cluster

Pig
Hadoop Parse Pig script and Execution compile into a set of Environment MapReduce jobs Monitor/Report ...

Hadoop Cluster
11

Installation Prerequisites
Java 6
With $JAVA_HOME environment variable properly set

Cygwin on Windows

Installation
Add pig script to path
export PIG_HOME=$CDH_HOME/pig-0.9.2-cdh4.0.0 export PATH=$PATH:$PIG_HOME/bin

$ pig -help Thats all we need to run in local mode

Think of Pig as a Pig Latin compiler, development tool and executor Not tightly coupled with Hadoop clusters

Pig Installation for Hadoop Mode

Make sure Pig compiles with Hadoop
Not a problem when using a distribution such as Cloudera Distribution for Hadoop (CDH)

Pig will utilize $HADOOP_HOME and $HADOOP_CONF_DIR variables to locate Hadoop configuration
We already set these properties during MapReduce installation Pig will use these properties to locate Namenode and Resource Manager

Running Modes
Can manually override the default mode via -x or -exectype options
$pig -x local $pig -x mapreduce
$ pig
2012-07-14 13:38:58,139 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Training/play_area/pig/pig_1342287538128.log 2012-07-14 13:38:58,458 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020

$ pig -x local
2012-07-14 13:39:31,029 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Training/play_area/pig/pig_1342287571019.log 2012-07-14 13:39:31,232 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

Running Pig
Script
Execute commands in a file $pig scriptFile.pig

Grunt
Interactive Shell for executing Pig Commands Started when script file is NOT provided Can execute scripts from Grunt via run or exec commands

Embedded
Execute Pig commands using PigServer class
Just like JDBC to execute SQL
16

Can have programmatic access to Grunt via PigRunner class

Pig Latin Concepts

Building blocks
Field piece of data Tuple ordered set of fields, represented with ( and )
(10.4, 5, word, 4, field1)

Bag collection of tuples, represented with { and }

{ (10.4, 5, word, 4, field1), (this, 1, blah) }

Similar to Relational Database

Bag is a table in the database Tuple is a row in a table Bags do not require that all tuples contain the same number
Unlike relational table
17

Simple Pig Latin Example

$ pig Start Grunt with default grunt> cat /training/playArea/pig/a.txt MapReduce mode a 1 d 4 Grunt supports file Load contents of text files c 9 system commands into a Bag named records k 6 grunt> records = LOAD '/training/playArea/pig/a.txt' as (letter:chararray, count:int); Display records bag to grunt> dump records; the screen ... org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 50% complete 2012-07-14 17:36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 100% complete ... (a,1) (d,4) Results of the bag named records (c,9) are printed to the screen (k,6) grunt>

DUMP and STORE statements

No action is taken until DUMP or STORE commands are encountered
Pig will parse, validate and analyze statements but not execute them

DUMP displays the results to the screen STORE saves results (typically to a file)
Nothing is executed; Pig will optimize this entire chunk of script
19

records = LOAD '/training/playArea/pig/a.txt' as (letter:chararray, count:int); ... ... ... ... ... DUMP final_bag; The fun begins here

Large Data
Hadoop data is usually quite large and it doesnt make sense to print it to the screen The common pattern is to persist results to Hadoop (HDFS, HBase)
This is done with STORE command

For information and debugging purposes you can print a small sub-set to the screen
grunt> records = LOAD '/training/playArea/pig/excite-small.log' AS (userId:chararray, timestamp:long, query:chararray); grunt> toPrint = LIMIT records 5; grunt> DUMP toPrint;

Only 5 records will be displayed

LOAD Command
LOAD 'data' [USING function] [AS schema];

data name of the directory or file

Must be in single quotes

USING specifies the load function to use

By default uses PigStorage which parses each line into fields using a delimiter
Default delimiter is tab (\t) The delimiter can be customized using regular expressions

AS assign a schema to incoming data

Assigns names to fields Declares types to fields
21

LOAD Command Example

Data records = LOAD '/training/playArea/pig/excite-small.log' USING PigStorage() AS (userId:chararray, timestamp:long, query:chararray); Schema User selected Load Function, there are a lot of choices or you can implement your own

Schema Data Types

Type int long float double chararray bytearray tuple bag map
23

Description Simple Signed 32-bit integer Signed 64-bit integer 32-bit floating point 64-bit floating point Arrays Character array (string) in Unicode UTF-8 Byte array (blob) Complex Data Types An ordered set of fields An collection of tuples An collection of tuples

Example 10 10L or 10l 10.5F or 10.5f 10.5 or 10.5e2 or 10.5E2 hello world

(19,2) {(19,2), (18,1)} [open#apache]

Source: Apache Pig Documentation 0.9.2; Pig Latin Basics. 2012

Pig Latin Diagnostic Tools

Display the structure of the Bag
grunt> DESCRIBE <bag_name>;

Display Execution Plan

Produces Various reports
Logical Plan MapReduce Plan

grunt> EXPLAIN <bag_name>;

Illustrate how Pig engine transforms the data

grunt> ILLUSTRATE <bag_name>;

Pig Latin - Grouping

grunt> chars = LOAD '/training/playArea/pig/b.txt' AS (c:chararray); grunt> describe chars; chars: {c: chararray} grunt> dump chars; (a) The chars bag is (k) Creates a new bag with element named grouped by c; ... group and element named chars therefore group ... element will contain (k) unique values (c) (k) grunt> charGroup = GROUP chars by c; grunt> describe charGroup; charGroup: {group: chararray,chars: {(c: chararray)}} grunt> dump charGroup; (a,{(a),(a),(a)}) chars element is a bag itself and (c,{(c),(c)}) contains all tuples from chars (i,{(i),(i),(i)}) bag that match the value form c (k,{(k),(k),(k),(k)}) (l,{(l),(l)})
25

ILUSTRATE Command
grunt> chars = LOAD /training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> ILLUSTRATE charGroup; -----------------------------| chars | c:chararray | -----------------------------| |c | | |c | ------------------------------------------------------------------------------------------------------------| charGroup | group:chararray | chars:bag{:tuple(c:chararray)} | -------------------------------------------------------------------------------| |c | {(c), (c)} | -------------------------------------------------------------------------------26

Inner vs. Outer Bag

grunt> chars = LOAD /training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> ILLUSTRATE charGroup; -----------------------------| chars | c:chararray | -----------------------------| |c | | |c | ------------------------------------------------------------------------------------------------------------| charGroup | group:chararray | chars:bag{:tuple(c:chararray)} -------------------------------------------------------------------------------| |c | {(c), (c)} --------------------------------------------------------------------------------

| |

Inner Bag
Outer Bag
27

Inner vs. Outer Bag

grunt> chars = LOAD '/training/playArea/pig/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> dump charGroup; (a,{(a),(a),(a)}) (c,{(c),(c)}) (i,{(i),(i),(i)}) (k,{(k),(k),(k),(k)}) (l,{(l),(l)}) Inner Bag
Outer Bag

Pig Latin - FOREACH

FOREACH <bag> GENERATE <data>
Iterate over each element in the bag and produce a result Ex: grunt> result = FOREACH bag GENERATE f1;
grunt> grunt> (a,1) (d,4) (c,9) (k,6) grunt> grunt> (1) (4) (9) (6)
29

records = LOAD 'data/a.txt' AS (c:chararray, i:int); dump records;

counts = foreach records generate i; dump counts;

For each row emit i field

FOREACH with Functions

FOREACH B GENERATE group, FUNCTION(A);
Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc... Can implement a custom function
grunt> chars = LOAD 'data/b.txt' AS (c:chararray); grunt> charGroup = GROUP chars by c; grunt> dump charGroup; (a,{(a),(a),(a)}) (c,{(c),(c)}) (i,{(i),(i),(i)}) (k,{(k),(k),(k),(k)}) (l,{(l),(l)}) grunt> describe charGroup; charGroup: {group: chararray,chars: {(c: chararray)}} grunt> counts = FOREACH charGroup GENERATE group, COUNT(chars); grunt> dump counts; (a,3) (c,2) For each row in charGroup bag, emit (i,3) group field and count the number of (k,4) items in chars bag (l,2)

TOKENIZE Function
Splits a string into tokens and outputs as a bag of tokens
Separators are: space, double quote("), coma(,) parenthesis(()), star(*)
grunt> linesOfText = LOAD 'data/c.txt' AS (line:chararray); grunt> dump linesOfText; Split each row line by space (this is a line of text) and return a bag of tokens (yet another line of text) (third line of words) grunt> tokenBag = FOREACH linesOfText GENERATE TOKENIZE(line); grunt> dump tokenBag; Each row is a bag of ({(this),(is),(a),(line),(of),(text)}) words produced by ({(yet),(another),(line),(of),(text)}) ({(third),(line),(of),(words)}) TOKENIZE function grunt> describe tokenBag; tokenBag: {bag_of_tokenTuples: {tuple_of_tokens: (token: chararray)}}
31

FLATTEN Operator
Flattens nested bags and data types FLATTEN is not a function, its an operator
Re-arranges output
grunt> dump tokenBag; Nested structure: bag of ({(this),(is),(a),(line),(of),(text)}) bags of tuples ({(yet),(another),(line),(of),(text)}) ({(third),(line),(of),(words)}) grunt> flatBag = FOREACH tokenBag GENERATE flatten($0); grunt> dump flatBag; (this) (is) Each row is flatten resulting in a (a) bag of simple tokens ... ... (text) Elements in a bag can (third) be referenced by index (line) (of) (words)

Conventions and Case Sensitivity

Case Sensitive
Alias names Pig Latin Functions

Case Insensitive
Pig Latin Keywords
Alias Case Sensitive

Function Case Sensitive

counts = FOREACH charGroup GENERATE group, COUNT(c);

Alias Case Sensitive Keywords Case Insensitive

General conventions
33

Upper case is a system keyword Lowercase is something that you provide

Problem: Locate Most Occurred Start Letter

Calculate number of occurrences of each letter in the provided body of text Traverse each letter comparing occurrence count Produce start letter that has the most occurrences
(For so this side of our known world esteem'd him) Did slay this Fortinbras; who, by a seal'd compact, Well ratified by law and heraldry, Did forfeit, with his life, all those his lands Which he stood seiz'd of, to the conqueror; Against the which a moiety competent Was gaged by our king; which had return'd To the inheritance of Fortinbras,

A B .. .. Z

89530 3920

876

T
34

495959

Most Occurred Start Letter Pig Way

1. 2. 3. 4. 5. Load text into a bag (named lines) Tokenize the text in the lines bag Retain first letter of each token Group by letter Count the number of occurrences in each group 6. Descending order the group by the count 7. Grab the first element => Most occurring letter 8. Persist result on a file system
35

1: Load Text Into a Bag

grunt> lines = LOAD '/training/data/hamlet.txt' AS (line:chararray);
Load text file into a bag, stick entire line into element line of type chararray

INSPECT lines bag:

grunt> lines: grunt> grunt> describe lines; {line: chararray} toDisplay = LIMIT lines 5; dump toDisplay;
Each row is a line of text

(This Etext file is presented by Project Gutenberg, in) (This etext is a typo-corrected version of Shakespeare's Hamlet,) (cooperation with World Library, Inc., from their Library of the) (*This Etext has certain copyright implications you should read!* (Future and Shakespeare CDROMS. Project Gutenberg often releases
36

2: Tokenize the Text in the Lines Bag

grunt> tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray;
For each line of text (1) tokenize that line (2) flatten the structure to produce 1 word per row

INSPECT tokens bag:

grunt> describe tokens tokens: {token: chararray} grunt> toDisplay = LIMIT tokens 5; grunt> dump toDisplay; (a) (is) Each row is now a token (of) (This) (etext)
37

3: Retain First Letter of Each Token

grunt> letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray;
For each token grab the first letter; utilize SUBSTRING function

INSPECT letters bag:

grunt> describe letters; letters: {letter: chararray} grunt> toDisplay = LIMIT letters 5; grunt> dump toDisplay; (a) (i) What we have no is 1 (T) character per row (e) (t)
38

4: Group by Letter
grunt> letterGroup = GROUP letters BY letter;

Create a bag for each unique character; the grouped bag will contain the same character for each occurrence of that character

INSPECT letterGroup bag:

grunt> describe letterGroup; letterGroup: {group: chararray,letters: {(letter: chararray)}} grunt> toDisplay = LIMIT letterGroup 5; grunt> dump toDisplay; (0,{(0),(0),(0)}) Next well need to convert (a,{(a),(a)) characters occurrences into (2,{(2),(2),(2),(2),(2)) counts; Note this display was (3,{(3),(3),(3)}) modified as there were too many (b,{(b)}) characters to fit on the screen
39

5: Count the Number of Occurrences in Each Group

grunt> countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters);
For each row, count occurrence of the letter

INSPECT countPerLetter bag:

grunt> describe countPerLetter; countPerLetter: {group: chararray,long} grunt> toDisplay = LIMIT countPerLetter 5; grunt> dump toDisplay; Each row now has the (A,728) character and the (B,325) number of times it was (C,291) found to start a word. (D,194) All we have to do is find (E,264)
40

the maximum

6: Descending Order the Group by the Count

grunt> orderedCountPerLetter = ORDER countPerLetter BY $1 DESC;
Simply order the bag by the first element, a number of occurrences for that element

INSPECT orderedCountPerLetter bag:

grunt> describe orderedCountPerLetter; orderedCountPerLetter: {group: chararray,long} grunt> toDisplay = LIMIT orderedCountPerLetter 5; grunt> dump toDisplay; (t,3711) (a,2379) All we have to do now is (s,1938) just grab the first element (m,1787) (h,1725)
41

7: Grab the First Element

grunt> result = LIMIT orderedCountPerLetter 1;

The rows were already ordered in descending order, so simply limiting to one element gives us the result

INSPECT orderedCountPerLetter bag:

grunt> describe result; result: {group: chararray,long} grunt> dump result; (t,3711)
There it is

8: Persist Result on a File System

grunt> STORE result INTO '/training/playArea/pig/mostSeenLetterOutput';
Result is saved under the provided directory

INSPECT result
$ hdfs dfs -cat /training/playArea/pig/mostSeenLetterOutput/part-r-00000 t 3711

result

Notice that result was stored int part-r-0000, the regular artifact of a MapReduce reducer; Pig compiles Pig Latin into MapReduce code and executes.

MostSeenStartLetter.pig Script
-- 1: Load text into a bag, where a row is a line of text lines = LOAD '/training/data/hamlet.txt' AS (line:chararray); -- 2: Tokenize the provided text tokens = FOREACH lines GENERATE flatten(TOKENIZE(line)) AS token:chararray; -- 3: Retain first letter of each token letters = FOREACH tokens GENERATE SUBSTRING(token,0,1) AS letter:chararray; -- 4: Group by letter letterGroup = GROUP letters BY letter; -- 5: Count the number of occurrences in each group countPerLetter = FOREACH letterGroup GENERATE group, COUNT(letters); -- 6: Descending order the group by the count orderedCountPerLetter = ORDER countPerLetter BY $1 DESC; -- 7: Grab the first element => Most occurring letter result = LIMIT orderedCountPerLetter 1; -- 8: Persist result on a file system STORE result INTO '/training/playArea/pig/mostSeenLetterOutput';

Execute the script:

$ pig MostSeenStartLetter.pig

Pig Tools
Community has developed several tools to support Pig
https://cwiki.apache.org/confluence/display/PIG/PigTools

We have PigPen Eclipse Plugin installed:

Download the latest jar release at https://issues.apache.org/jira/browse/PIG-366
As of writing org.apache.pig.pigpen_0.7.5.jar

Place jar in eclupse/plugins/ Restart eclipse

Pig Resources
Apache Pig Documentation
http://pig.apache.org
Programming Pig Alan Gates (Author) O'Reilly Media; 1st Edition (October, 2011) Chapter About Pig Hadoop: The Definitive Guide Tom White (Author) O'Reilly Media; 3rd Edition (May6, 2012) Chapter About Pig Hadoop in Action Chuck Lam (Author) Manning Publications; 1st Edition (December, 2010)
46

Pig Resources
Hadoop in Practice Alex Holmes (Author) Manning Publications; (October 10, 2012)

2012 coreservlets.com and Dima May

Wrap-Up
Customized Java EE Training: http://courses.coreservlets.com/
Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.

Developed and taught by well-known author and developer. At public venues or onsite at your location.

Summary
We learned about
Pig Overview Execution Modes Installation Pig Latin Basics Resources

We developed Pig Script to locate Most Occurred Start Letter

2012 coreservlets.com and Dima May

Questions?
JSF 2, PrimeFaces, Java 7, Ajax, jQuery, Hadoop, RESTful Web Services, Android, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training.

Customized Java EE Training: http://courses.coreservlets.com/

Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jQuery, Spring, Hibernate, RESTful Web Services, Android.

Developed and taught by well-known author and developer. At public venues or onsite at your location.

06 Pig 01 Intro 1
No ratings yet
06 Pig 01 Intro 1
23 pages
Understanding Apache Pig Architecture
No ratings yet
Understanding Apache Pig Architecture
33 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit III
No ratings yet
Unit III
118 pages
Unit-4 SGS
No ratings yet
Unit-4 SGS
13 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Applications of Apache Pig in Big Data
No ratings yet
Applications of Apache Pig in Big Data
10 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Pig
No ratings yet
Pig
16 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Introduction to Pig Latin in Big Data
No ratings yet
Introduction to Pig Latin in Big Data
58 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Pig 2
No ratings yet
Pig 2
63 pages
Pig Latin: Simplifying Hadoop for All
No ratings yet
Pig Latin: Simplifying Hadoop for All
9 pages
PIG
No ratings yet
PIG
9 pages
Pig Notes-1
No ratings yet
Pig Notes-1
6 pages
Unit 4 Apachepig 210825041412
No ratings yet
Unit 4 Apachepig 210825041412
16 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Unit 3
No ratings yet
Unit 3
26 pages
Essential Hadoop Tools Overview
No ratings yet
Essential Hadoop Tools Overview
35 pages
Pig SKB
No ratings yet
Pig SKB
7 pages
3 Pig
No ratings yet
3 Pig
77 pages
Pig: Building High-Level Dataflows Over Map-Reduce
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce
59 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
BDA Unit5
No ratings yet
BDA Unit5
36 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Pig Architecture
No ratings yet
Pig Architecture
7 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
BDA - UNIT 4 PIG Notes
No ratings yet
BDA - UNIT 4 PIG Notes
9 pages
Apache Pig Guide: Features & Functions
No ratings yet
Apache Pig Guide: Features & Functions
31 pages
U5 Big Data Aktu
No ratings yet
U5 Big Data Aktu
32 pages
Big Data Processing with Pig
No ratings yet
Big Data Processing with Pig
12 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
BDP U4
No ratings yet
BDP U4
58 pages
Big Data Analytics: Apache Pig
No ratings yet
Big Data Analytics: Apache Pig
52 pages
Big Data Unit IV
No ratings yet
Big Data Unit IV
19 pages
Apache Pig Data Processing Guide
No ratings yet
Apache Pig Data Processing Guide
10 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Unit 5
No ratings yet
Unit 5
24 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
5 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Pig Setup and Test Run: by Kannan Kalidasan
No ratings yet
Pig Setup and Test Run: by Kannan Kalidasan
17 pages
PostgreSQL Admin for IT Pros
50% (2)
PostgreSQL Admin for IT Pros
109 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
Apache Hive Execution Environments
No ratings yet
Apache Hive Execution Environments
23 pages
Objective-C Quickstart Guide
No ratings yet
Objective-C Quickstart Guide
94 pages
Hibernate Getting Started Guide
No ratings yet
Hibernate Getting Started Guide
16 pages
Liferay Portlet Dev Guide
No ratings yet
Liferay Portlet Dev Guide
21 pages
Java Annotations For Your Alfresco Content Model
100% (1)
Java Annotations For Your Alfresco Content Model
7 pages
Task 3 - Valentina Casiani
No ratings yet
Task 3 - Valentina Casiani
5 pages
Magellan's 1521 Philippine Voyage Map
No ratings yet
Magellan's 1521 Philippine Voyage Map
13 pages
Comparative Adjectives Guide
No ratings yet
Comparative Adjectives Guide
4 pages
Direct Assault Case Against Pedro Santos
50% (2)
Direct Assault Case Against Pedro Santos
5 pages
Understanding Number Systems
No ratings yet
Understanding Number Systems
5 pages
Revision of Future Tenses
No ratings yet
Revision of Future Tenses
11 pages
Tears of The Begums Stories of Survivors of The Uprising of 1857 (Khwaja Hasan Nizami, Translated by Rana Safvi)
No ratings yet
Tears of The Begums Stories of Survivors of The Uprising of 1857 (Khwaja Hasan Nizami, Translated by Rana Safvi)
208 pages
Osmanlıca Yazma 250 Havas Kitabı PDF
100% (1)
Osmanlıca Yazma 250 Havas Kitabı PDF
3 pages
Grammar Notes - Simple, Continuous Past&Present Tense
No ratings yet
Grammar Notes - Simple, Continuous Past&Present Tense
7 pages
Educational Technology & Society
No ratings yet
Educational Technology & Society
256 pages
Ten Expressions With "Right" The Right/wrong Way Round
No ratings yet
Ten Expressions With "Right" The Right/wrong Way Round
8 pages
COBOL Quesbank
No ratings yet
COBOL Quesbank
23 pages
Lesson Plan 4 Sem 3
No ratings yet
Lesson Plan 4 Sem 3
3 pages
10 Meaningful Conversation Lessons: For Adult ESL Learners
100% (1)
10 Meaningful Conversation Lessons: For Adult ESL Learners
39 pages
Mastering Power Words for English Mastery
No ratings yet
Mastering Power Words for English Mastery
37 pages
Tercero 4,6 y 7
No ratings yet
Tercero 4,6 y 7
4 pages
Off and Runing
No ratings yet
Off and Runing
20 pages
Lampinen Studiesin Ancient Oraclesand Divination
No ratings yet
Lampinen Studiesin Ancient Oraclesand Divination
43 pages
W8-Review Test 1
No ratings yet
W8-Review Test 1
3 pages
Discourse Analysis For Students
No ratings yet
Discourse Analysis For Students
15 pages
Thanksgiving Writing Lesson
No ratings yet
Thanksgiving Writing Lesson
10 pages
The TaskJuggler User Manual
No ratings yet
The TaskJuggler User Manual
15 pages
Kurdish Oral Literature
No ratings yet
Kurdish Oral Literature
40 pages
In The Struggle To Be An All American Girl by Elizabeth Wong
No ratings yet
In The Struggle To Be An All American Girl by Elizabeth Wong
2 pages
Assignment On Syllabus Design LET 6105
No ratings yet
Assignment On Syllabus Design LET 6105
8 pages
TCC Project Eduardo
No ratings yet
TCC Project Eduardo
19 pages
Untitled
No ratings yet
Untitled
5 pages
Verbal Test
No ratings yet
Verbal Test
14 pages
Daily Activities Lesson Plan for XI
100% (1)
Daily Activities Lesson Plan for XI
5 pages
Rubrics For Poem Recitation
100% (6)
Rubrics For Poem Recitation
1 page

Apache Pig: For Live Hadoop Training, Please See Courses

Uploaded by

Apache Pig: For Live Hadoop Training, Please See Courses

Uploaded by

2012 coreservlets.

com and Dima May

2012 coreservlets.com and Dima May

For live Hadoop training, please see courses at http://courses.coreservlets.com/.

Top Level Apache Project

Pig is an abstraction on top of Hadoop

Pig is widely accepted and used

Pig and MapReduce

Pig provides high-level language that can be used by

Originally implemented at Yahoo! to allow analysts to access data

Pigs Use Cases

Research of raw information

Pig compiler converts Pig Latin to MapReduce

$ pig -help Thats all we need to run in local mode

Pig Installation for Hadoop Mode

Can have programmatic access to Grunt via PigRunner class

Pig Latin Concepts

Bag collection of tuples, represented with { and }

Similar to Relational Database

Simple Pig Latin Example

DUMP and STORE statements

Only 5 records will be displayed

data name of the directory or file

USING specifies the load function to use

AS assign a schema to incoming data

LOAD Command Example

Schema Data Types

(19,2) {(19,2), (18,1)} [open#apache]

Source: Apache Pig Documentation 0.9.2; Pig Latin Basics. 2012

Pig Latin Diagnostic Tools

Display Execution Plan

grunt> EXPLAIN <bag_name>;

Illustrate how Pig engine transforms the data

Pig Latin - Grouping

Inner vs. Outer Bag

Inner vs. Outer Bag

Pig Latin - FOREACH

records = LOAD 'data/a.txt' AS (c:chararray, i:int); dump records;

counts = foreach records generate i; dump counts;

For each row emit i field

FOREACH with Functions

Conventions and Case Sensitivity

Function Case Sensitive

counts = FOREACH charGroup GENERATE group, COUNT(c);

Upper case is a system keyword Lowercase is something that you provide

Problem: Locate Most Occurred Start Letter

Most Occurred Start Letter Pig Way

1: Load Text Into a Bag

INSPECT lines bag:

2: Tokenize the Text in the Lines Bag

INSPECT tokens bag:

3: Retain First Letter of Each Token

INSPECT letters bag:

INSPECT letterGroup bag:

5: Count the Number of Occurrences in Each Group

INSPECT countPerLetter bag:

6: Descending Order the Group by the Count

INSPECT orderedCountPerLetter bag:

7: Grab the First Element

INSPECT orderedCountPerLetter bag:

8: Persist Result on a File System

Execute the script:

We have PigPen Eclipse Plugin installed:

Place jar in eclupse/plugins/ Restart eclipse

2012 coreservlets.com and Dima May

We developed Pig Script to locate Most Occurred Start Letter

2012 coreservlets.com and Dima May

Customized Java EE Training: http://courses.coreservlets.com/

You might also like