PIG tutorials

Installation

Download

 

Untar

tar xvfz pig-0.16.0.tar.gz

Move

mv pig-0.16.0/ ~/

Environment Variables

copy the following in the /home/amar/sourceme file

export PIG_HOME=/home/amar/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/conf

------------------------------------------------------------------

Data Set

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai

Load Data



student = LOAD '/user/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

Store Data



STORE student INTO '/gitam/arm' USING PigStorage (',');
hadoop dfs -ls /gitam/arm
hadoop dfs -cat /gitam/arm/part-m-00000


Explain and Illustrate

The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.
Explain relation_name
Describe Relation_name


The illustrate operator gives you the step-by-step execution of a sequence of statements.

grunt> illustrate Relation_name;

New Data Set

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
student_details = LOAD '/user/student_details.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
group_data = GROUP student_details by age;
Dump group_data;


Describe group_data;

group_multiple = GROUP student_details by (age, city);







customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00




orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060



grunt> customers = LOAD '/user/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> orders = LOAD '/user/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Inner Join

grunt> customer_orders = JOIN customers BY id, orders BY customer_id;

Left Outer Join

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Right Outer Join

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Full Outer Join

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.

grunt> student = UNION student1, student2;

The SPLIT operator is used to split a relation into two or more relations.


SPLIT student_details into student_details1 if age<23, student_details2 if (age > 22 and age < 25);
 Foreach loop -
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

grunt> distinct_data = DISTINCT student_details;
grunt> order_by_data = ORDER student_details BY age DESC;
 
grunt> limit_data = LIMIT student_details 4;

Data

– https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-08/

To get the top 4 most used pages –

records = LOAD '/user/pagecounts-20160801-000000' USING PigStorage(' ') as (projectName:chararray,pageName:chararray,pageCount:int,pageSize:int);

filtered_records = FILTER records by projectName == 'en';
 grouped_records = GROUP filtered_records by pageName;
 results = FOREACH grouped_records generate group, SUM(filtered_records.pageCount);
 sorted_result = ORDER results by $1 desc;
 limit_data = LIMIT sorted_result 4;
 dump limit_data;