Pig

Getting started:

prerequisites : Java and HDFS.

extract and set its path on environment :

tar -xzf pig-0.15.0.tar.gz

#for centos

vim /etc/bashrc

#for ubuntu

vim /etc/bash.bashrc

#append following lines in either of above files.

export PIG_HOME=<ABSOLUTE_PATH_OF_EXTRACTED_TAR>

export PATH=$PATH:${PIG_HOME}/bin

reload your bash/prompt and test pig/dukkar/suar ;)

pig -help

Setup Eclipse for Pig script development: downloads and guidelines

First script: To read passwd file(contains ':' separated values) and print id(available at 0th position in each line).

in bash hit pig to get grunt(interactive mode of pig) in shell.

grunt> A = load 'passwd' using PigStorage(':');

grunt> B = foreach A generate $0 as id;

grunt> dump B;

script can be placed on local file system, aws-s3, hdfs and run.

Local script file:

pig -f /tmp/file_reader.pig

script file on hdfs:

pig hdfs://localhost:9000/user/root/file_reader.pig

Running Pig away from hadoop

If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:

fs.default.name=hdfs://namenode_address:8020/

mapred.job.tracker=jobtracker_address:8021

You can also specify remote cluster address at the command line:

pig -fs namenode_address:8020 -jt jobtracker_address:8021

Google Sites

Report abuse