Trang chủ‎ > ‎IT‎ > ‎Hadoop‎ > ‎

Running a Wordcount Mapreduce example in Hadoop 2.4.1 Single-node Cluster in Ubuntu 14.04 (64-bit)

Step 1: 

Open Eclipse IDE  ( download fromhttp://www.eclipse.org/downloads/ ) and create a new project with 3 class files – WordCount.java , WordCountMapper.java and WordCountReducer.java

Step 2: 

Open WordCount.java and paste the following code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
 
 
public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
      {
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);
            conf.setJobName("WordCount");
 
            //Setting configuration object with the Data Type of output Key and Value
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);
 
            //Providing the mapper and reducer class names
            conf.setMapperClass(WordCountMapper.class);
            conf.setReducerClass(WordCountReducer.class);
            //We wil give 2 arguments at the run time, one in input path and other is output path
            Path inp = new Path(args[0]);
            Path out = new Path(args[1]);
            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, inp);
            FileOutputFormat.setOutputPath(conf, out);
 
            JobClient.runJob(conf);
            return 0;
      }
    
      public static void main(String[] args) throws Exception
      {
            // this main function will call run method defined above.
        int res = ToolRunner.run(new Configuration(), new WordCount(),args);
            System.exit(res);
      }
}

Step 3: 

 Open WordCountMapper.java and paste the following code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
 
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
    
      //map method that performs the tokenizer job and framing the initial key value pairs
      // after all lines are converted into key-value pairs, reducer is called.
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            //taking one line at a time from input file and tokenizing the same
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
        
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);
            }
       }
}

Step 4: 

 Open WordCountReducer.java and paste the following code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.io.IOException;
import java.util.Iterator;
 
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
 
public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
          {
               sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
      }
}

Step 5: 

You need to remove dependencies by adding jar files in the hadoop source folder. Now Click on Project tab and go to Properties.Under Libraries tab, click Add External JARs and select all the jars in the folder (click on 1st jar, and Press Shift and Click on last jat to select all jars in between and click ok) /usr/local/hadoop/share/hadoop/commonand /usr/local/hadoop/share/hadoop/mapreduce folders.
 

Step 6: 

Now Click on the Run tab and click Run-Configurations. Click on New Configuration button on the left-top side and Apply after filling the following properties. ( see image )

  • Name – Any name will do – Ex: WordCountConfig
  • Project – Browse and select your project
  • Main Class – Select WordCount.java – this is our main class

Step 7: 

 Now click on File tab and select Export. under Java, select Runnable Jar. ( see image )

  • In Launch Config – select the config fie you created in Step 6  (WordCountConfig). 
  • Select an export destination ( lets say desktop.) 
  • Under Library handling, select Extract Required Libraries into generated JAR and click Finish.
  • Right-Click the jar file, go to Properties and under Permissionstab, Check Allow executing file as a program. and give Read and Write access to all the users.

Hadoop Part ( NOTE: user should be hduser in the terminal )

Step 8:  

  • Switch to hduser –   sudo su hduser
  • Change directory to Hadoop conf –   cd /usr/local/hadoop/etc/hadoop
  • Delete temporary folders –
    • sudo rm -R /app/*
    • sudo rm -R /tmp/*
  • Format the namenode (exit status in the last few lines in the terminal should be 0) –  hadoop namenode -format  
  • Start all daemons –  start-dfs.sh && start-yarn.sh
  • Check if DataNode, Namenode are running ( minimum requirements for running a MapReduce in Single-node) using jps –  jps
  • If DataNode or Namenode is not starting, check if other programs are running in ports reserved for hadoop.
    • 1
      2
      3
      4
      sudo netstat -tulpn | grep :8040
      sudo netstat -tulpn | grep :8042
      sudo netstat -tulpn | grep :50070
      sudo netstat -tulpn | grep :50075
    • If you see any program running, kill it by using this command where last number ( 10234 in this case ) is the PID ( process ID seen from sudo netstat -tulpn | grep :port command )
      • sudo kill -9 10234
  • After you have DataNode and Namenode running, proceed toStep 9.

Step 9:  

  • Make a hdfs directory ( Note: These directories are not listed when ls is used in the terminal and they are also not visible in the File Browser ) –  hadoop dfs -mkdir -p /usr/local/hadoop/input
  • Copy the sample input text file into this hdfs directory –   hadoop dfs -copyFromLocal /home/kishorer747/Desktop/sample.txt /usr/local/hadoop/input
  • Change directory to run an example Wordcount program using jar file. NOTE: Don’t create output folder out1, it  will be created and every time you run an example, give a new directory. These directories are not visible with ls command in terminal.
  • 1
    2
    cd /home/kishorer747/Desktop
    hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output

Step 10:

  • To view the results-   
  • hdfs dfs -cat /usr/local/hadoop/output/part-r-00000   or  

   hdfs dfs -cat /usr/local/hadoop/output/part-00000

  • To Remove folders created using hdfs-  hdfs dfs -rm -R /usr/local/hadoop/output
  • To view folders in hdfs-  hdfs dfs -ls /usr/local/hadoop/
  • To stop all hadoop daemons, first change directory – cd /usr/local/hadoop/etc/hadoop
  • And then use –  stop-dfs.sh && stop-yarn.sh
Comments