Getting started:
prerequisites : Java and HDFS.
download Pig : pig-0.15.0.tar.gz
extract and set its path on environment :
tar -xzf pig-0.15.0.tar.gz
#for centos
vim /etc/bashrc
#for ubuntu
vim /etc/bash.bashrc
#append following lines in either of above files.
export PIG_HOME=<ABSOLUTE_PATH_OF_EXTRACTED_TAR>
export PATH=$PATH:${PIG_HOME}/bin
reload your bash/prompt and test pig/dukkar/suar ;)
pig -help
Setup Eclipse for Pig script development: downloads and guidelines
First script: To read passwd file(contains ':' separated values) and print id(available at 0th position in each line).
in bash hit pig to get grunt(interactive mode of pig) in shell.
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;
script can be placed on local file system, aws-s3, hdfs and run.
Local script file:
pig -f /tmp/file_reader.pig
script file on hdfs:
pig hdfs://localhost:9000/user/root/file_reader.pig
Running Pig away from hadoop
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021