Flume NG
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It uses a simple extensible data model that allows for online analytic application. Cloudera Flume NG v1.3.0 is configured to have the appropriate source, channel, sink, etc. Once the configuration is done, it can be executed from the command shell. ‘Source’ denotes the designated folder from which Flume reads event data and then stores the data at the designated ‘Channel’. Multiple numbers of ‘Sinks’ are mapped to each channel which transfers the data from the channel to HDFS. Once the whole file ingestion is completed, the input files are tagged with “.COMPLETED” status and deletes them to make space for new files. The Flume agent can be configured to read data from a web service port. In this case, the ‘Source’ interface continuously listen a defined port and reads events and then stores the events in the designated channel. The HDFS sink is associated with each channel to transfer the events to the HDFS. Flume agents resides at staging area, starts to ingest the data to HDFS once the data file reaches to staging area.
DistCp
Hadoop distCp is a utility to copying large amount of data between two Hadoop clusters using Map/Reduce.
S3DistCP
s3DistCP has been used to copy data from S3 bucket to HDFS in two ways –
To copy data from S3 to HDFS directly using s3distcp utility.
To copy the data from S3 bucket to EMR staging. Once the data moved to HDFS over EMR, S#Distcp is used to transfer the data HDFS on EC2 .
Sqoop
Sqoop is used in to transfer data from any relational Data storage system (DSS) to HDFS on . Eventually Sqoop is also used to export data from HDFS into Relational DSS for any analytics tool that require data to be available in RDBMS.
Pig
Pig is a tool given for data analytics and adhoc reporting in Hadoop. Pig is basically written in a scripting language called pig latin . Pig although is commonly used to do adhoc data processing but it can be efficiently used for ETL processes and to move data from local systems to HDFS using the MapReduce Parallel process .
Distcp
Distcp is a hadoop tool that provides parallel data transfer from any HDFS cluster to another HDFS cluster , for AWS based systems disctcp provides a mechanism to transfer data from S3 into HDFS . This makes data ingestion process simpler and cost effective for Hadoop solutions. Distcp is a map based transfer process which computationally reliable and less processor hungry then its counterpart S3Distcp S3Distcp S3Distcp is a Amazon AWS specific Hadoop tool that provides a better ingestion process that can ingest data from S3 location into Hadoop filesystem. This process is reducer heavy and hence requires better system configuration.Having said that S3Distcp is faster than distcp.
Discussion: When to use which feature
One of the most important factor to consider before considering any ingestion mechanism in Big Data Solution is to check our own hardware configuration and identify the most cost effective solution that will solve our problem . As far as ingestion process goes more than the processing capabilities of CPU and RAM the most important factor is the network speeds. In today’s Big Data solution offering the infrastructure is provisioned in some cloud offering like AWS or in some in-house datacenter . In case of the in-house datacenter we can always choose the network bandwidth and transfer rates before setting up a Hadoop cluster but in any cloud based offering we need to ensure the data that we expect to process and based on that we need to get the instances provisioned that have specific network performance. The below table shows a set of guidelines on when to transfer 1 TB of data over different bandwidth options: Minimum data size to consider mailing a drive 100GB 600GB 2TB 5TB 60TB
T3 (44.736Mbps)
100Mbps
1000Mbps
Available Internet Bandwidth
T1 (1.544Mbps)
10Mbps
Transferring data over physical drive
In this approach, the data flow is described below:
Clients will provide the data drive to the data discovery team.
Data discovery team will mail the drive to the hosting provider.
The data discovery team will notify the clients upon successful receipt of the data.
Data discovery team ingests the data into Hadoop, keeping two copies of the data.
A master copy which will serve as a backup
A working copy for the data discovery process.
Transferring data using S3
In this approach, the data flow is as described below(for more details see TA 237): 1.1 The client uploads data to S3 over HTTPS 1.2 Once the entire data is uploaded on S3, using s3disctcp utility, the data is moved to HDFS system over EC2/EMR. However, data from S3 bucket can also be transferred directly to HDFS: 2.1 Data from S3 bucket can be transfer directly to HDFS master area using s3distcp utility. The following architecture is a simple and generic data ingestion and processing blue print for any Big data solution. Here data is stored into S3 which provides a reliable and cost effective data storage option. Even In case the Big Data solution uses AWS or not we can always have S3 buckets as a standard data storage centres. Once the data is into S3, any solution needs the data to be moved into HDFS system for processing and running analytics. This data movement process known as ingestion can be done using options using Flume, S3distcp ,Sqoop or Pig processes.