The Data Blog. - hadoop-ingestion-approaches-1

Hadoop Ingestion Approaches -1

Overview

In this blog we will discuss the Hadoop ingestion approaches in details and will try to figure out the challenges in each section. We will also try to identify pattern and approaches and figure out some rules that will help us identify when to use which ingestion approach. The goal of this blog is to provide a realistic approach in dealing with variety of data format and have a baseline created to deal with the variety of data formats using the different ingestion techniques. By now we all agree that ingestion is one of the most important and most critical phase of Hadoop solution. In any business scenario of any solution architecture which requires a Hadoop solution, data ingestion is the primary and most important process before starting with actual analytics. And in more often than not the ingestion approaches and the processes makes or breaks a solution offering. In this blog we will go through the different ingestion approaches and their charactestics:

A Basic Ingestion Big Data Approach

This section is a simple Ingestion Technology stack explaining different scenarios of processes that are associated with a Big Data solution; this workflow provides a high level component view of the services and their capabilities. We will go through the technology stack and will identify the advantages of each component and their capabilities.

Flume NG

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It uses a simple extensible data model that allows for online analytic application. Cloudera Flume NG v1.3.0 is configured to have the appropriate source, channel, sink, etc. Once the configuration is done, it can be executed from the command shell. ‘Source’ denotes the designated folder from which Flume reads event data and then stores the data at the designated ‘Channel’. Multiple numbers of ‘Sinks’ are mapped to each channel which transfers the data from the channel to HDFS. Once the whole file ingestion is completed, the input files are tagged with “.COMPLETED” status and deletes them to make space for new files. The Flume agent can be configured to read data from a web service port. In this case, the ‘Source’ interface continuously listen a defined port and reads events and then stores the events in the designated channel. The HDFS sink is associated with each channel to transfer the events to the HDFS. Flume agents resides at staging area, starts to ingest the data to HDFS once the data file reaches to staging area.

DistCp

Hadoop distCp is a utility to copying large amount of data between two Hadoop clusters using Map/Reduce.

S3DistCP

s3DistCP has been used to copy data from S3 bucket to HDFS in two ways –

To copy data from S3 to HDFS directly using s3distcp utility.
To copy the data from S3 bucket to EMR staging. Once the data moved to HDFS over EMR, S#Distcp is used to transfer the data HDFS on EC2 .

Sqoop

Sqoop is used in to transfer data from any relational Data storage system (DSS) to HDFS on . Eventually Sqoop is also used to export data from HDFS into Relational DSS for any analytics tool that require data to be available in RDBMS.

Pig

Pig is a tool given for data analytics and adhoc reporting in Hadoop. Pig is basically written in a scripting language called pig latin . Pig although is commonly used to do adhoc data processing but it can be efficiently used for ETL processes and to move data from local systems to HDFS using the MapReduce Parallel process .

Distcp

Distcp is a hadoop tool that provides parallel data transfer from any HDFS cluster to another HDFS cluster , for AWS based systems disctcp provides a mechanism to transfer data from S3 into HDFS . This makes data ingestion process simpler and cost effective for Hadoop solutions. Distcp is a map based transfer process which computationally reliable and less processor hungry then its counterpart S3Distcp S3Distcp S3Distcp is a Amazon AWS specific Hadoop tool that provides a better ingestion process that can ingest data from S3 location into Hadoop filesystem. This process is reducer heavy and hence requires better system configuration.Having said that S3Distcp is faster than distcp.

Discussion: When to use which feature

One of the most important factor to consider before considering any ingestion mechanism in Big Data Solution is to check our own hardware configuration and identify the most cost effective solution that will solve our problem . As far as ingestion process goes more than the processing capabilities of CPU and RAM the most important factor is the network speeds. In today’s Big Data solution offering the infrastructure is provisioned in some cloud offering like AWS or in some in-house datacenter . In case of the in-house datacenter we can always choose the network bandwidth and transfer rates before setting up a Hadoop cluster but in any cloud based offering we need to ensure the data that we expect to process and based on that we need to get the instances provisioned that have specific network performance. The below table shows a set of guidelines on when to transfer 1 TB of data over different bandwidth options: Minimum data size to consider mailing a drive 100GB 600GB 2TB 5TB 60TB

T3 (44.736Mbps)

100Mbps

1000Mbps

Available Internet Bandwidth

T1 (1.544Mbps)

10Mbps

Transferring data over physical drive

In this approach, the data flow is described below:

Clients will provide the data drive to the data discovery team.
Data discovery team will mail the drive to the hosting provider.
The data discovery team will notify the clients upon successful receipt of the data.
Data discovery team ingests the data into Hadoop, keeping two copies of the data.
A master copy which will serve as a backup
A working copy for the data discovery process.

Transferring data using S3

In this approach, the data flow is as described below(for more details see TA 237): 1.1 The client uploads data to S3 over HTTPS 1.2 Once the entire data is uploaded on S3, using s3disctcp utility, the data is moved to HDFS system over EC2/EMR. However, data from S3 bucket can also be transferred directly to HDFS: 2.1 Data from S3 bucket can be transfer directly to HDFS master area using s3distcp utility. The following architecture is a simple and generic data ingestion and processing blue print for any Big data solution. Here data is stored into S3 which provides a reliable and cost effective data storage option. Even In case the Big Data solution uses AWS or not we can always have S3 buckets as a standard data storage centres. Once the data is into S3, any solution needs the data to be moved into HDFS system for processing and running analytics. This data movement process known as ingestion can be done using options using Flume, S3distcp ,Sqoop or Pig processes.

Data Ingestion through Flume,sqoop,S3distcp or pig is governed on various approaches that are specific to the solution architecture that is being designed , but on top of this the following table suggests the properties of each of the processes

Ingestion System Example : Production

This diagram provides a basic data ingestion example in real life. Here client has put the data inside AWS S3. The data is then pulled from S3 into HDFS using the s3distcp command that runs in EMR. Here we chose S3distcp as we wanted to balance the initial transfer along with minimizing the cost of the process. Also as S3distcp is part of the EMR we will need no additional ingestion setup and maintenance setups. The data once into HDFS can be used to perform various analytical processes.