Hadoop Map Reduce Terminology

HDFS:

Hadoop File System was developed using distributed file system design.
It is run on commodity hardware.
HDFS is highly fault-tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from possible data losses in case of failure.

HDFS Architecture:

Map Reduce:

MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes.

Name Node:

Node that manages Hadoop Distributed File System (HDFS)
Name node is also known as Master
NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster.
NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down.
Namenode holds the metadata for HDFS like Block information, size etc. This Information is stored in main memory as well as disk for persistence storage .
The information is stored in 2 different files .They are
- Editlogs- It keeps track of each and every changes to HDFS.

Fsimage- It stores the snapshot of the file system.

Data Node:

Node where data is presented in advance before any processing takes place
DataNode is responsible for storing the actual data in HDFS
DataNode is also known as the Slave
NameNode and DataNode are in constant communication.
When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.
DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
Every DataNode sends a heartbeat message to the Name Node every 3 seconds and conveys that it is alive. In the scenario when Name Node does not receive a heartbeat from a Data Node for 10 minutes, the Name Node considers that particular Data Node as dead and starts the process of Block replication on some other Data Node..

Secondary Name Node:

The main function of the Secondary namenode is to store the latest copy of the FsImage and the Edits Log files.

Job Tracker:

Schedules jobs and tracks the assign jobs to Task tracker.
JobTracker process runs on a separate node and not usually on a DataNode.
JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2.
JobTracker talks to the NameNode to determine the location of the data.
JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client.

Task Tracker:

Tracks the task and reports status to Job Tracker
TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.

Record Reader:

This is the first phase of MapReduce where the Record Reader reads every line from the input text file as text and yields output as key-value pairs.
Input − Line by line text from the input file.
Output − Forms the key-value pairs.

Ex. <Key, Value>

<1, What do you mean by object>

<2, What do you about Java>

Record Writer:

This is the last phase of MapReduce where the Record Writer writes every key-value pair from the Reducer phase and sends the output as text.
Input − Each key-value pair from the Reducer phase along with the Output format.
Output − It gives you the key-value pairs in text format.

Ex. What 2

do 2

you 2

Combiner:

A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets.
It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper.
It is not a part of the main MapReduce algorithm; it is optional.

How Combiner Works?

A combiner does not have a predefined interface and it must implement the Reducer interface’s reduce() method.
A combiner operates on each map output key. It must have the same output key-value types as the Reducer class.
A combiner can produce summary information from a large dataset because it replaces the original Map output.

Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce phase, which makes it easier to process.

Input − The following key-value pair is the input taken from the Map phase.

<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>

<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>

The Combiner phase reads each key-value pair, combines the common words as key and values as collection. Usually, the code and operation for a Combiner is similar to that of a Reducer.
Output − The expected output is as follows

<What, 1,1> <do, 1,1> <you, 1,1> <mean,1> <by, 1> <Object, 1> <know, 1> <about, 1> <Java, 1>

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException

{

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens())

{

word.set(itr.nextToken());

context.write(word, one);

}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>

{

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

{

int sum = 0;

for (IntWritable val : values)

{

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception

{

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Partitioner:

A partitioner works like a condition in processing an input dataset.
The partition phase takes place after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer.
Data format for the following example Id /t name /t age /t gender /t salary

package partitionerexample;

import java.io.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.*;

import org.apache.hadoop.mapreduce.lib.input.*;

import org.apache.hadoop.mapreduce.lib.output.*;

import org.apache.hadoop.util.*;

public class PartitionerExample extends Configured implements Tool

{

//Map class

public static class MapClass extends Mapper<LongWritable,Text,Text,Text>

{

public void map(LongWritable key, Text value, Context context)

{

try{

String[] str = value.toString().split("\t", -3);

String gender=str[3];

context.write(new Text(gender), new Text(value));

}

catch(Exception e)

{

System.out.println(e.getMessage());

}

//Reducer class

public static class ReduceClass extends Reducer<Text,Text,Text,IntWritable>

{

public int max = -1;

public void reduce(Text key, Iterable <Text> values, Context context) throws IOException, InterruptedException

{

max = -1;

for (Text val : values)

{

String [] str = val.toString().split("\t", -3);

if(Integer.parseInt(str[4])>max)

max=Integer.parseInt(str[4]);

}

context.write(new Text(key), new IntWritable(max));

}

//Partitioner class

public static class CaderPartitioner extends Partitioner < Text, Text >

{

@Override

public int getPartition(Text key, Text value, int numReduceTasks)

{

String[] str = value.toString().split("\t");

int age = Integer.parseInt(str[2]);

if(numReduceTasks == 0)

{

return 0;

}

if(age<=20)

{

return 0;

}

else if(age>20 && age<=30)

{

return 1 % numReduceTasks;

}

else

{

return 2 % numReduceTasks;

}

@Override

public int run(String[] arg) throws Exception

{

Configuration conf = getConf();

Job job = new Job(conf, "topsal");

job.setJarByClass(PartitionerExample.class);

FileInputFormat.setInputPaths(job, new Path(arg[0]));

FileOutputFormat.setOutputPath(job,new Path(arg[1]));

job.setMapperClass(MapClass.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

//set partitioner statement

job.setPartitionerClass(CaderPartitioner.class);

job.setReducerClass(ReduceClass.class);

job.setNumReduceTasks(3);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)? 0 : 1);

return 0;

}

public static void main(String ar[]) throws Exception

{

int res = ToolRunner.run(new Configuration(), new PartitionerExample(),ar);

System.exit(0);

}

UNIT - IV

Hadoop I/O

Any value in Hadoop must be Writable.
A Writable is an interface in Hadoop and types in Hadoop must implement this interface.
Hadoop provides these writable wrappers for almost all Java primitive types and some other types.

Why does Hadoop use Writable(s)?

As we already know, data needs to be transmitted between different nodes in a distributed computing environment.
This requires serialization and deserialization of data to convert the data that is in structured format to byte stream and vice-versa.
Hadoop therefore uses simple and efficient serialization protocol to serialize data between map and reduce phase and these are called Writable(s).
Some of the examples of writables are IntWritable, LongWritable, BooleanWritable and FloatWritable.

Writable Interface

The Writable interface defines two methods

One for writing its state to a DataOutput binary stream
One for reading its state from a DataInput binary stream

package org.apache.hadoop.io;

import java.io.DataOutput;

import java.io.DataInput;

import java.io.IOException;

public interface Writable

{

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

}

Writable Comparable and comparators

IntWritable implements the WritableComparable interface, which is just a sub interface of the Writable and java.lang.Comparable interfaces

package org.apache.hadoop.io;

public interface WritableComparable <T> extends Writable, Comparable <T>

{

}

Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys are compared with one another
One optimization that Hadoop provides is the RawComparator extension of java's comparator

package org.apache.hadoop.io;

import java.util.Comparator;

public interface RawComparator <T> extends Comparator <T>

{

public int compare (byte [ ] b1, int s1, int l1, byte [ ] b2, int s2, int l2);

}

where b1,b2 are byte arrays, s1, s2 are starting positions, l1,l2 are lengths
This interface permits implementors to compare records read from a stream without deserializing them into objects, there by avoiding any overhead of object creation.
Writable comparator is a general purpose implementation of RawComparator for Writable Comparable classes
It provides two main functions, first it provides default implementation of raw compare() method, second, it acts as a factory for RawComparator instances
For example, to obtain a comparator for IntWritable, we just use

RawComparator <IntWritable> c = WritableComparator.get(IntWritable.class);

The Comparator can be used to compare two IntWritable objects

IntWritable w1 =new IntWritable(164);

IntWritable w2 =new IntWritable(67);

assertThat(c.compare(w1,w2), greaterThan(0));

Writable classes

Hadoop comes with a large collection of Writable classes in the org. apache.hadoop.io package.

Writable wrappers for java primitives

There are Writable wrappers for all the java primitive data types except char (which can be stored in IntWritable)
All have a get() and set() method for retrieving and storing the wrapped value.

IntWritable is a wrapper for java int.
We can create IntWritable and set its value using the set() method

IntWritable iw= new IntWritable();

iw.set(154);

Equivalently, we can use the constructor that takes the integer value.

IntWritable iw = new IntWritable(154);

How do you choose between a fixed length and a variable length encoding?

Fixed length encodings are good when the distribution of values is fairly uniform across the whole value space.
Most numeric variables tend to have non uniform distributions, and on average variable length encoding will save space.
Another advantage of variable length encoding is that you can switch from VIntWritable to VLongWritable, because their encodings are actually the same.

Text

Text is a Writable for UTF-8 sequences (UTF - Unicode Transformation Format, 8- eight bit block)
It is Writable equivalent of java.lang.String
The Text class uses an int (with a variable length encoding) to store the number of bytes in the string encoding, so the maximum value is 2 GB.
Text uses standard UTF-8, which makes it potentially easier to interoperate with other tools that understand UTF-8.

Text t=new Text("hadoop");

t.set("pig");

Text doesn't have as rich an API for manipulating strings as java.lang. String, so in many cases you need convert the Text object to a String. This is done using toString() method.

t.toString();

BytesWritable

BytesWritable is a wrapper for an array of binary data.
Its serialized format is an integer field (4 bytes) that specifies the number of bytes to follow, followed by the bytes themselves.
For example, the byte array of length two with two values 3and 5 is serialized as 000000020305.

BytesWritable b = new BytesWritable (new byte[] {3,5});

BytesWritable is mutable , and its value may be changed by calling its set() method.

Null Writable

NullWritable is a special type of Writable, as it has a zero length serialization.
No bytes are written to or read from the stream
It is used as a placeholder
In MapReduce, a key or a value can be declared as a NullWritable when you dont need to use that position, effectively storing a constant empty value.
It is immutable singleton, and the instance can be retrieved by calling NullWritable.get()

Object Writable and Generic Writable

ObjectWritable is a general-purpose wrapper for java primitives, String, enum, Writable, null or arrays of any of these types.
ObjectWritable is useful when a field can be of more than one type.
Being a general-purpose mechanism, it wasted a fair amount of space because it writes class name of the wrapped type every time it is serialized.
If the number of types is small and known ahead of time, this can be improved by having a static array of types and using the index into the array as the serialized reference to the type. This aapoach can be taken in Generic writable.

Writable collections

There are six Writable collection types in the org.apache.hadoop.io package

ArrayWritable
ArrayPrimitiveWritable
TwoDArrayWritable
MapWritable
SortedMapWritable
EnumSetWritable

ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two dimensional arrays (array of arrays) of Writable instances.

ArrayWritable aw = new ArrayWritable(Text.class);

if the Writable is defined by the type, you need to subclass to set the type statically.

public class TextArrayWritable extends ArrayWritable

{

public TextArrayWritable()

{

super(Text.class);

}

ArrayWritable and TwoDArrayWritable both have get() and set() methods as well as a toArray() method, which createa a shallow copy of the array.
ArrayPrimitiveWritable is a wrapper for arrays of java primitives.
The component type is detected when you call set(), so there is no need to subclass to set the type.
MapWritable and SortedMapWritable are implementations of java.util.Map<Writable, Writable>and java.util.SortedMap <WritableComparable, Writable> respectively.

MapWritable src = new MapWritable();

src.put(new IntWritable(1), new Text("cat"));

src.put(new VIntWritable(2), new LongWritable(163));

EnumSetWritable is used for sets of enum types.

Implementing a custom Writable

If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface.
With a custom Writable, you have full control over the binary representation and the sort order.
Because Writables are at the heart of MapReduce data path, tuning the binary representation can have a significant effect on performance.
To demonstrate how to create a custom Writable, we shall write the implementation that represents a pair of strings, called TextPair.

Prog: The Writable implementation that stores a pair of Text objects

import java.io.*;

import org.apache.hadoop.io.*;

public class TextPair implements WritableComparable<TextPair>

{

private Text first;

private Text second;

public TextPair()

{

set(new Text(), new Text());

}

public TextPair(String first, String second)

{

set(new Text(first), new Text(second));

}

public TextPair(Text first, Text second)

{

set(first, second);

}

public void set(Text first, Text second)

{

this.first = first;

this.second = second;

}

public Text getFirst()

{

return first;

}

public Text getSecond()

{

return second;

}

@Override

public void write(DataOutput out) throws IOException

{

first.write(out);

second.write(out);

}

@Override

public void readFields(DataInput in) throws IOException

{

first.readFields(in);

second.readFields(in);

}

@Override

public int hashCode()

{

return first.hashCode() * 163 + second.hashCode();

}

@Override

public boolean equals(Object o)

{

if (o instanceof TextPair)

{

TextPair tp = (TextPair) o;

return first.equals(tp.first) && second.equals(tp.second);

}

return false;

}

@Override

public String toString()

{

return first + "\t" + second;

}

@Override

public int compareTo(TextPair tp)

{

int cmp = first.compareTo(tp.first);

if (cmp != 0)

{

return cmp;

}

return second.compareTo(tp.second);

}

TextPair is an implementation of WritableComparable, so it provides implementation of the compareTo() method that imposes the ordering i.e, it sorts by the first string followed by second string.

Implementing a Raw Comparator for speed

We can further optimize the above code using RawComparator/WritableComparator
Two TextPair objects can be compared without deserialization.

Prog: A Raw Comparator for comparing Text pair binary representations

public static class Comparator extends WritableComparator

{

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public Comparator()

{

super(TextPair.class);

}

@Override

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

{

try {

int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);

int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

if (cmp != 0)

{

return cmp;

}

return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1, b2, s2 + firstL2, l2 - firstL2);

}

catch (IOException e)

{

throw new IllegalArgumentException(e);

}

static

{

WritableComparator.define(TextPair.class, new Comparator());

}

The static block registers the raw comparator so that whenever MapReduce sees the TextPair class, it knows to use the raw comparator as its default comparator.

Custom comparators

We can write our own comparators using some of the implementations of Writable in the org.apache.hadoop.io package
The utility methods on WritableUtils are very handy.

Prog: A custom RawComparator for comparing the first field of TextPair byte representations

public static class FirstComparator extends WritableComparator

{

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public FirstComparator()

{

super(TextPair.class);

}

@Override

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

{

try

{

int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);

int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

}

catch (IOException e)

{

throw new IllegalArgumentException(e);

}

@Override

public int compare(WritableComparable a, WritableComparable b)

{

if (a instanceof TextPair && b instanceof TextPair)

{

return ((TextPair) a).first.compareTo(((TextPair) b).first);

}

return super.compare(a, b);

}

Google Sites

Report abuse