Hadoop for Other Languages

Hadoop supports programming in other languages (python, R, ..etc) via genericAPI called streaming.

Similar to pipes concept in unix.

Format is " Input | Mapper | Reducer

Ex: echo "this sentence has five words" |cat | wc

Default input format: "Key \t value \n"

Running a normal hadoop job

hadoop jar wordcount.jar inputfile outputfile

Running a hadoop streaming job

hadoop jar hadoop-streaming.jar -input inputfile -output outputfile -file filename -mapper path -reducer path

Ex: word count hadoop streaming job using linux commands

hadoop jar hadoop-streaming.jar -input hello.txt -output /output -mapper cat -reducer "wc -l" # to count the number of lines in the input file

hadoop fs -ls -R /output

Two files will be listed like /output/SUCCESS and /output/part-00001

hadoop fs -cat /output/part* #shows content of output file

Ex: word count hadoop streaming job using python

Mapper.py -mapper code in python

Reducer.py -reducer code in python

hadoop jar hadoop-streaming.jar -input inputfile -output outputfil -file mappercodefile -mapper mappername -file reducercodefile -reducer reducername

Google Sites

Report abuse