Home‎ > ‎

How to write output to multiple named files in Hadoop using MultipleTextOutputFormat

Sometime we want our Map Reduce job to output data in named files.
 
For e.g.
Suppose you have input file that contains the following data
Name:Dash
Age:27
Name:Nish
Age:29
 
This can be done in Hadoop by using MultipleTextOutputFormat class. The following is a simple example implementation of MultipleTextOutputFormat class which will read the file above and create 2 output files Name and Age
The code where the action happens is highlighted in red 
package org.myorg;

import java.io.*;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.lib.*;


public class mult{
        static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
                public void map(LongWritable key, Text value,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
                        String [] dall=value.toString().split(":");
                        output.collect(new Text(dall[0]),new Text(dall[1]));
                }
        }

        static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
                public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
                        while (values.hasNext()) {
                                output.collect(key, values.next());
                        }
                }
        }


        static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text> {
                protected String generateFileNameForKeyValue(Text key, Text value,String name) {
                        return key.toString();
                }
        }


        public static void main(String[] args) throws Exception {
                String InputFiles=args[0];
                String OutputDir=args[1];

                Configuration mycon=new Configuration();
                JobConf conf = new JobConf(mycon,mult.class);

                conf.setOutputKeyClass(Text.class);
                conf.setMapOutputKeyClass(Text.class);
                conf.setOutputValueClass(Text.class);

                conf.setMapperClass(Map.class);
                conf.setReducerClass(Reduce.class);

                conf.setInputFormat(TextInputFormat.class);
                conf.setOutputFormat(MultiFileOutput.class);

                FileInputFormat.setInputPaths(conf,InputFiles);
                FileOutputFormat.setOutputPath(conf,new Path(OutputDir));
                JobClient.runJob(conf);

        }
}


The output would be files Name and Age. 
File Name contains data 
Name    Nish
Name    Dash

File Age contains data
Age     27
Age     29

Class MultiFileOutput  extends MultipleTextOutputFormat. What this means is that when the reducer is ready to spit out the Key/Value pair then before writing it to a file, it passes them to method generateFileNameForKeyValue. The logic to name the output file is the embedded in this method (in this case the logic is to create 1 file per key). The String returned by method generateFileNameForKeyValue determines the name of the file where this Key/Value pair is logged. 
 
Comments