Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
Create a directory, let’s say hadoop-streaming both in your home directory and HDFS
hadoop fs -mkdir /user/<user>/hadoop-streaming
mkdir -p /home/<user>/hadoop-projects/hadoop-streaming
Copy the wordcount directory containing mapper and reducer python scripts, mapper.py and reducer.py and cd to wordcount
cp -r /home/sxg125/hadoop-projects/hadoop-streaming/wordcount /home/<user>/hadoop-projects/hadoop-streaming
cd wordcount
While at wordcount directory, run the command (Note that we are using the same input files, file01,file02, as in the above example at /user/<user>/wordcount/input:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -file mapper.py -file reducer.py -mapper mapper.py -reducer reducer.py -input /user/<user>/wordcount/input/* -output /user/<user>/hadoop-streaming/output-wc
Validate the output:
hadoop fs -cat /user/<user>/hadoop-streaming/output-wc/part*
Create a separate directory for bash, bash-streaming in HDFS:
hadoop fs -mkdir /user/<user>/hadoop-streaming/bash-input
copy the directory bash-streaming containing huge text files in your hadoop streaming directory and cd to hadoop-streaming/wordcount
cp -r /home/sxg125/hadoop-projects/hadoop-streaming/bash-streaming /home/<user>/hadoop-projects/hadoop-streaming
cd hadoop-streaming/bash-streaming
Copy the input files "pg*" in your input directory at HDF /user/<user>/hadoop-streaming/bash-input
hadoop fs -put pg* /user/<user>/hadoop-streaming/bash-input
Run the command:
hadoop jar /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.0.jar -mapper '/bin/cat' -reducer '/usr/bin/wc -l' -input /user/<user>/hadoop-streaming/bash-input/* -output /user/<user>/hadoop-streaming/bash-output
Check the output:
hadoop fs -cat /user/<user>/hadoop-streaming/bash-output/part*
output:
3767
20676
3843
3823
3835
3861
3795
3812
3737
3786
3861
3807
3881
3859
3832
3758