Apache Pig [7] is a scripting platform for processing and analyzing large data sets. It allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS).
The Pig Latin statements in the Pig script (id*.pig) [8] extract all user IDs from the passwd file. First, copy these files from /home/sxg125/hadoop-projects/pig to your home directory. Next, run the Pig script (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).
Run the pig command and you will get the grunt prompt:
pig -x local
Copy and paste each line and from the file "id-interactive.pig". Make sure that the input file 'passwd' is in the current directory.
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> store B into 'output';
output:
...
Input(s):
Successfully read records from: "file:///home/sxg125/hadoop-projects/pig/passwd"
Output(s):
Successfully stored records in: "file:///home/sxg125/hadoop-projects/pig/output"
Exit from grunt:
quit
See the output in the output directory:
cat output/part-m-00000
output:
root
bin
daemon
adm
...
Create input and output directories in HDFS. Replace the bold ones with the appropriate path.
hadoop fs -mkdir /user/sxg125/projects/pig
hadoop fs -mkdir /user/sxg125/projects/pig/input
Copy the input file "passwd" to the input directory
hadoop fs -put passwd /user/sxg125/projects/pig/input
Modify the input path and output path in the script "id-mpareduce.pig" according to the input and output HDFS path you created (replace the bold ones with appropriate path):
A = load '/user/sxg125/projects/pig/input/passwd' using PigStorage(':'); -- load the passwd file
B = foreach A generate $0 as id; -- extract the user IDs
store B into '/user/sxg125/projects/pig/output'; -- write the results to a file name id.out
Execute:
pig -x mapreduce id-mpareduce.pig
Check the output:
hadoop fs -cat /user/sxg125/projects/pig/output/part*
output:
root
bin
daemon
adm
...
create a directory myudfs
mkdir myudfs
Copy the file UPPER.java from "/home/sxg125/hadoop-projects/pig/udf" to myudfs. This java code converts the first field into the Uppercase string.
cp /home/sxg125/hadoop-projects/pig/udf/myudfs/UPPER.java myudfs
Compile the java code:
javac -cp /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/lib/pig/pig.jar:`hadoop classpath` UPPER.java
Create a jar file after "cd ..":
jar -cf myudfs.jar myudfs
Copy the input file "student_data" to HDFS input directory. Modify the path according to yours.
hadoop fs -put student_data /user/sxg125/projects/pig/input
Copy the pig script file "pig-udf-mapreduce.pig" file from "/home/sxg125/hadoop-projects/pig/udf" to the current directory. Modify the input and output paths in the script according to yours.
cp /home/sxg125/hadoop-projects/pig/udf/myudfs/pig-udf-mapreduce.pig .
Run the script:
pig -x mapreduce pig-udf-mapreduce.pig
See the output:
hadoop fs -cat /user/sxg125/projects/pig/output/part*
output:
SANJAYA,20,3.2
MEKA,16,4.0