Hadoop Pig

HADOOP Pig

Apache Pig [7] is a scripting platform for processing and analyzing large data sets. It allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS).

The Pig Latin statements in the Pig script (id*.pig) [8] extract all user IDs from the passwd file. First, copy these files from /home/sxg125/hadoop-projects/pig to your home directory. Next, run the Pig script (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).

Interactive Job submission

Run the pig command and you will get the grunt prompt:

pig -x local

Copy and paste each line and from the file "id-interactive.pig". Make sure that the input file 'passwd' is in the current directory.

grunt> A = load 'passwd' using PigStorage(':');

grunt> B = foreach A generate $0 as id;

grunt> store B into 'output';

output:

...

Input(s):

Successfully read records from: "file:///home/sxg125/hadoop-projects/pig/passwd"

Output(s):

Successfully stored records in: "file:///home/sxg125/hadoop-projects/pig/output"

Exit from grunt:

quit

See the output in the output directory:

cat output/part-m-00000

output:

root

bin

daemon

adm

...

Batch Job

Create input and output directories in HDFS. Replace the bold ones with the appropriate path.

hadoop fs -mkdir /user/sxg125/projects/pig

hadoop fs -mkdir /user/sxg125/projects/pig/input

Copy the input file "passwd" to the input directory

hadoop fs -put passwd /user/sxg125/projects/pig/input

Modify the input path and output path in the script "id-mpareduce.pig" according to the input and output HDFS path you created (replace the bold ones with appropriate path):

A = load '/user/sxg125/projects/pig/input/passwd' using PigStorage(':'); -- load the passwd file

B = foreach A generate $0 as id; -- extract the user IDs

store B into '/user/sxg125/projects/pig/output'; -- write the results to a file name id.out

Execute:

pig -x mapreduce id-mpareduce.pig

Check the output:

hadoop fs -cat /user/sxg125/projects/pig/output/part*

output:

root

bin

daemon

adm

...

User Defined Function (UDF)

create a directory myudfs

mkdir myudfs

Copy the file UPPER.java from "/home/sxg125/hadoop-projects/pig/udf" to myudfs. This java code converts the first field into the Uppercase string.

cp /home/sxg125/hadoop-projects/pig/udf/myudfs/UPPER.java myudfs

Compile the java code:

javac -cp /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/pig/pig.jar:`hadoop classpath` UPPER.java

Create a jar file after "cd ..":

jar -cf myudfs.jar myudfs

Copy the input file "student_data" to HDFS input directory. Modify the path according to yours.

hadoop fs -put student_data /user/sxg125/projects/pig/input

Copy the pig script file "pig-udf-mapreduce.pig" file from "/home/sxg125/hadoop-projects/pig/udf" to the current directory. Modify the input and output paths in the script according to yours.

cp /home/sxg125/hadoop-projects/pig/udf/myudfs/pig-udf-mapreduce.pig .

Run the script:

pig -x mapreduce pig-udf-mapreduce.pig

See the output:

hadoop fs -cat /user/sxg125/projects/pig/output/part*

output:

SANJAYA,20,3.2

MEKA,16,4.0

Page updated

Report abuse