Hadoop Components

Hadoop Components

Introduction

Hadoop has been introduced in numerous articles/blogs but we would like to quote this from [1] which very well defines Hadoop.

"Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible."

Hadoop works on the fundamentals of distributed storage and distributed computation. In future articles, we will see how large files are broken into smaller chunks and distributed to different machines in the cluster, and how parallel processing works using Hadoop. Let's get started with Hadoop components.

Hadoop Components

Hadoop architecture can be categorised into three - Client Machines, Master Servers and Slave Servers. These are basically daemons or programs that run on different physical servers -

NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker

Clients

Client machines have Hadoop installed on them with all the cluster settings, but are neither Master nor Slave. The Clients load data into the cluster, submit jobs and retrieve/view result of the jobs when finished.

NameNode and DataNodes

Hadoop comes with a filesystem and since it manages the storage of files across several machines, it is called Hadoop Distributed FileSystem (HDFS). In HDFS, large files are broken down into smaller blocks (64MB, by default) which are stored as independent units. The size of the block is configurable but most implementations use default value (or sometimes changed to 128MB) to allow relatively low-latency disk reads and writes.

To ensure against corrupted block/disk or machine failure, each block is replicated to physically separate machines. The default replication factor is 3 but this is also configurable. The blocks on the failed machine is replicated back from alternative location to other machines.

HDFS cluster has two types of nodes operating in master-slave relationship: NameNode (master) and DataNodes (slaves). The NameNode maintains the filesystem metadata of HDFS; it keeps track of all files that are broken down into blocks and which DataNodes store these blocks. All transaction logs are stored in a file called EditLog which is stored on the local disk. The entire filesystem namespace, including the mapping of blocks to files and filesystem properties, are stored in a file called FsImage which is also stored on the local disk.

The DataNodes are the real workhorses of HDFS; they store and retrieve blocks when they receive instructions from Clients or NameNode. A DataNode may communicate with other DataNodes to replicate blocks for redundancy. Upon initialization, each DataNode reports back to the NameNode the list of blocks they are storing. After this, the DataNodes continually poll NameNode to provide information regarding local changes as well as receive instructions to create, move or delete blocks from local disk.

Secondary NameNode

There is only one NameNode in the cluster and it is the most critical component of the Hadoop architecture as it maintains HDFS for the cluster, monitors the health of DataNodes and co-ordinates access to data. Without the NameNode, the filesystem cannot be used and all the files would be lost as there is no way to reconstruct the files from the blocks on the DataNodes. Hence, the NameNode should be made resilient to failure with good server configuration, redundant power supplies, fans, NICs, etc. Hadoop has two mechanisms-

The first option is to backup the files that make up the persistent state of filesystem metadata (these files are namespace and edit log). A common configuration is to write to local disk and remote disk mount.

The second option is to run a Secondary NameNode. The Secondary NameNode isn't actually part of HDFS, but it communicates with the NameNode on a periodic interval (configurable value) to take the snapshot of the HDFS metadata. This information is used in case of NameNode failure. However, the information on Secondary NameNode is normally not up-to-date and some data loss is almost certain.

Please note - HDFS Federation and HDFS High-Availability (HA) using ZooKeeper will be discussed later or in another article.

JobTracker and TaskTrackers

There are two types of nodes that control the job execution process: JobTracker and TaskTrackers. The Client submits a job (also called a MapReduce job) to the JobTracker to process a particular file. The JobTracker determines the DataNodes that store the blocks for that file by consulting the NameNode. The JobTracker assigns tasks to different TaskTrackers based on the information received from the NameNode, and monitors the status of each task.

The TaskTrackers constantly communicate with the JobTracker. If the JobTracker fails to receive heartbeat from a TaskTracker within a specified amount of time, that TaskTracker is assumed to have failed, and will re-submit the corresponding tasks to other nodes in the cluster.

Data Flow

HDFS Write

When a Client is ready to load a file into the cluster, it breaks the file into blocks. The Client makes an RPC call to the NameNode and requests a list of DataNodes to host replicas of first block of the file. Note that the NameNode assigns a unique Block ID to the block and determines a list of DataNodes. The Client organizes a pipeline from DataNode to DataNode and pushes the data. The first DataNode stores the data and forwards it to the second DataNode in the pipeline. Similarly, the second DataNode stores the data and forwards it to the last DataNode in the pipeline (assuming the replication factor of 3). All writes are acknowledged.

When the first block is filled, the Client requests new DataNodes to host replicas of the next block. A new pipeline is organised and the Client pushes further data to it.

The NameNode chooses DataNodes based on following default strategy - place the first replica on the same node as the client (for clients running outside the cluster, choose a random node that is not too full or too busy). The second replica is placed on a different rack from the first. The third replica is placed on the same rack as the second, but on a different node. How does NameNode know to which rack a DataNode reside? See Rack Awareness section next.

Rack Awareness

Hadoop is rack-aware to get maximum performance i.e. it knows the topology of the network. For multi-rack cluster, you need to map nodes to racks. This is done manually or using a script (provided the IP address assignment is done hierarchically). By doing this, Hadoop will prefer within (aka on-rack) transfers where there is more bandwidth available as compared to off-rack transfers for MapReduce tasks on nodes. HDFS can place replicas more intelligently as well.

HDFS Read

When a Client wishes to read it file, it makes an RPC call to the NameNode to determine the location of the blocks of the file. For each block, the NameNode returns the addresses of all DataNodes that have a copy of that block. However, this list of DataNodes is then sorted based on network proximity to the Client. The Client then connects to the closest DataNode for the first block in the file. Data is streamed from DataNode to the Client and when the end of block is reached, the connection is closed. The Client then connects the closest DataNode for the second block.

Summary

In this article, we looked at the key nodes and their roles, and HDFS. NameNode is the most important node in Hadoop. Rack awareness is a key concept in Hadoop which helps with optimization and fault tolerance.

References

[1] Yahoo! Developer Network