Home‎ > ‎

How to read Hadoop HDFS Sequence file using Hadoop streaming

Its very common to use unix "cat" command with hadoop streaming but it does not works if the underlying file is a sequence file. To get around the situation, run hadoop streaming with configuration
 
-inputformat SequenceFileAsTextInputFormat
 
 
The following code will read total lines in the file that is stored as sequence file in HDFS
HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-dev-streaming.jar \
  -input <inpu>
  -output <output> \
  -mapper "/bin/cat" \
  -reducer "/bin/wc -l" \
  -inputformat SequenceFileAsTextInputFormat


Comments