Hadoop Ingestion Systems – 2

What is Big Data

In today’s world corporates are struggling with their data size, organizations are looking into ways to tap new potential markets and brands are finding it hard and challenging to identify and collaborate with new and existing customers.

In today’s dynamic world it’s not sufficient to know one’s business alone. It’s equally if not more important to know ones customers and provide then with value added services. In such a scenario we cannot be limited by the processing and storing capabilities of traditional systems.

That’s not all , the exploding data along with the changing structure of data makes it difficult for even storing the data , the OLTP , OLAP and ODS systems are not capable enough to store and use the data for complex analytical capabilities .

Based on the analysis, the data is growing exponentially every year and the data is not only confined to a specific structure but sources of data is also growing from web logs and sensor data to images and pdfs.

Following diagram showcases the increase of volume of data over the years:

These are only a handful of problems staring at us in todays world , we need a technology that can help us with the next generation of issues which are bigger in terms of volume and variety of data.

Big Data provides a platform that helps business address the problems stated above. Big data is a capability that addresses the questions that are raised due to raising data and increasing dependency to analyze and get meaningful business insites.

Following are the key differentiators that provides a distinction between traditional systems and next gen Hadoop big data systems.

Briefly Big Data systems try to answer problems that arises in today’s world due to volume ,variety , velocity and value of the data.

What is Poly-structured data

Hadoop specializes in dealing with data that are otherwise not handled by traditional systems, out of the many features and functionalities provided by Big Data Hadoop , one of the key feature is ability to handle variety of data.

This section describes why dealing with variety of data is a challenge , what kinds of approaches are required to handle the variety of data and how the Hadoop ecosystem has evolved to handle different mechanism of ingestion.

In any Hadoop based architecture, most systems has three layers as described below

This diagram depicts the Hadoop ecosystem holistically, in any Hadoop system one of the major areas of concern is to identify sources of data and determine the ways to ingest the data into Hadoop system. Data of varied schema and relationship and of various types and formats needs to be stored in Hadoop system in order to get meaningful incites. This process of inbound data transfer from external system into Hadoop core forms a major scope in designing robust Hadoop architectures.

This document focuses on the different ingestion approaches, different data types and different processes to get external data into Hadoop ecosystem.

This diagram also shows the Hadoop cluster and its interaction with Data storage system layer (DSS) and different visualization and Analytics tools used for reporting which although forms an integral part of the data architecture but is considered out of scope for this document.

How Polystructured data requires special handling than traditional data

Traditional system handles data that have a schema and relationships attached to it very easily. Data Storage Systems like RDBMS handles the relationship between similar data that adheres to a specific schema very efficiently.

Traditional systems also provide analytics and incites on these data when the volume of data is manageable. These types of system is ideal for Transactional data that are mostly found in OLTP (Online Transaction Processing Systems) .

However as discussed in this document, it’s not sufficient to handle data that is only driven by the business. Also most of the information generated in today’s world does not only come from core systems but are generated by applications and tools that are part of the entire business process.

To be honest most of the information and business incites are tapped in data that gets generated using logs, call center chats, emails, sms, voice communication and lots of other channels that forms part of customers interaction with different interfaces other than only the transactional systems.

The data that’s gets generated from these diverse systems and services are different in nature and most importantly these data are of different types and formats. However it is equally important to look into the data from a single point of view because together these data tells a whole story.

Because of the diverse forms of the data and its importance to analyze the same together it’s impossible to use traditional data storage systems to store and analyze these schemas less diverse data together to perform complex analytics.

This type of data which although are varied by their schema and structure but are bound together by a common use case that needs to be processed together to provide a complete picture is called poly structured data.

Polystructured data can be simple chat information which when analyzed can tell us about the sentiments of the people involved , or can be email communication in any company or can be as complex as images and social status information or can be combination of all.

Each of these polystructured data needs to be treated separately but needs to be saved in a common platform at the end of the day for processing.

Hadoop provides a platform where these data generated from various ways and processes are saved together in such a way that processing and analytics of these data is easily manageable.

Each data is different and has to be treated differently, and most importantly these data needs to be transformed and loaded into Hadoop in such a way that all data coming from different sources can be managed in Hadoop for processing. This process is called ETL (Extract Load and Transform). ETL process provides a way where various data ingested can be converted into a common format that can be processed by the Hadoop systems.

This section describes how different data with their characteristics can be ingested into the Hadoop system so that the processing can take place.

This section also describes the polystructured data and how Hadoop uses different ingestion mechanisms to ensure that ETL processes can take the data and convert them into information that can be loaded into Hadoop for further processing.

The most important distinction between the polystructured data is the way the data needs to be entered into Hadoop system. Data can be uploaded by the clients in bulk or can be ingested real time over network or can be ingested as part of existing relational data base model or can be combination of one or many.

Also data ingestion can also vary based on the importance of the data. Some data can require more reliability and fail over assurance while other data might be limited by rate of ingestion and time to live.

That’s not all, there are also many industry standard storage systems available that acts as de facto standard way to store huge amount of data. These systems require special handling for ingestion data into Hadoop systems.

This document will cover the various ingestion mechanisms and the rationale for each ingestion mechanism that needs to be used for each process.

Why is polystructured data important for Hadoop

This section tries to provide incite as why polystructured data is important for Hadoop and why data should be treated equally irrespective of the origin.

Consider a situation where bank wants to put up an ATM machine , this required a lot of analytics as which location is most profitable , communications are easy and most people tend to visit.

But apart from these important information that can be used to identify potential location of putting up an ATM is to identify customer complaints about lack of ATMs for that area. This information is usually very important and not readily available in structured environment like databases but can be acquired if we look into other places: like customer care chats , voice calls to bank front office or even by checking some social media .

This information are definitely not organized or not structured but they contain valuable information that can impact a major decision like location to put up ATMs.

Consider another scenario where a company has an email exchange server that prevents unsolicited emails and spams. Normally this type of server needs to read email information and scan any attachments before labelling it as safe. If the company is huge or the volume of emails coming from the exchange server is large then scanning and labelling the correct emails from the spams cannot be done using traditional systems.

These examples above stated the underlying fact that most business decisions of important business process takes its initiative by requiring to process that which is poly structured. Data that although have relationships with each other but differ in their structure and formation. And because these polystructured data are so different it required different techniques to process and store them.

The section although provide a brief about the importance of polystructured data , in next section it will try to address different processes and ways to handle these diverse forms of data.

How Hadoop Handles Polystructured data

As data to be ingested is classified into various types, the process and way to ingest these inbound traffic also changes in Hadoop ecosystem, sometimes it becomes important to consider the volume part of the data when time is a constraint and sometimes it becomes important to prioritize the velocity of the data being ingested with the network and hardware capacity being a constraint.

Hadoop classifies and distinguishes the data that needs to be ingested into the following types and provides different set of tools for handling each set:

a. Bulk Data:

Bulk data is a primary source of data that needs to be analyzed and processed in Hadoop systems. Bulk data normally refer to huge set of logs files generated by application logs or system logs that needs to be processed for analytics.

Normally bulk data runs into terabytes to petabytes in sizes. This huge set of data needs to be ingested into Hadoop within a given time frame. In such a scenario traditional systems can run into days to weeks to ingest the data which negates the requirement of analyzing the data for business incites because by the time business incites are drawn the data might have changed considerably.

Hadoop provides the following tools to ingest the huge data set.

Flume-ng:

Flume-ng(next generation) is a tool that performs data ingestion from any source (like folders, FTP locations, network drives etc) into HDFS over Hadoop.

The following diagram describes the Flume architecture.

Flume has three main components i.e Source, Channel and Sink.

Source component connects to external world where different data is stored. The data that is stored may be in some folders, may be in FTP location, maybe in remote machines or can be a combination of all.

Source data are read by Flume using different protocols that ease the data ingestion mechanism. Out of the box Flume provides access to read data from source in form of Avro , Thift , Syslog , Netcat and spool dir.

Each of these access types provides a simple and reliable way of reading data from one or more sources.

The second component in Flume is the channel which transfers the data from the source to the sink for storage. Channel is more like a buffer that handles and stores intermediate data in case the rate of incoming data is more than rate of outgoing data. Channels can be file based, database based or as simple as memory based.

The third and final component of Flume is the sink that reads the data from the channel and stores the data into final destination, invariably for Hadoop systems the destination is HDFS file systems.

The below diagram provides more details as how each Flume agent functions:

This diagram provides a complete picture of Flume in working where polystructured data from various sources are loaded by different sources in the Flume Agents and then saved back finally into the HDFS sink after passing through various iterations of transferring the data to the sink using channels.

Hadoop CopyFromLocal/Put:

Hadoop itself comes with a few sets of tools to ingest data as is into HDFS. These tools are part of the Hadoop ecosystem and provides reliable and faster way of transferring data .

To use copyFromLocal or Put Hadoop utility we need to fire these tools in the Hadoop environment itself with source as the location and destination as HDFS.

Hadoop DistCp :

Hadoop provides another tool within the ecosystem that leverages the Hadoop architecture to transfer data from one Hadoop cluster to another.

This tool provides a better failover resistance and a faster way to effectively transfer the data from one Hadoop cluster to another.

S3DistCP:

AWS (Amazon Web Services) are a big player in Cloud Infrastructure, in practical world most of the Hadoop clusters are invariably setup in AWS. This section and selection of tools would not be complete before touching a few tools that AWS provides.

Normally for any Hadoop setup in AWS the data is always loaded from S3 buckets. S3 buckets are 256 bit AES encoded file system that has unlimited capacity. As S3 is a great storage platform, data from client Data centers are loaded into these S3 buckets . When required for analytics these data are loaded into HDFS for processing .

S3 provides S3DistCP tool which is very similar to Hadoop Distcp tool to transfer the data from S3 buckets to HDFS.

This tool provides huge throughput and low latency way of ingesting data from S3 into Hadoop one AWS.

HFTP :

HFTP is a Hadoop filesystem implementation that lets you read data from a remote Hadoop HDFS cluster. The reads are done via HTTP, and data is sourced from DataNodes. HFTP is a read-only filesystem, and will throw exceptions if you try to use it to write data or modify the filesystem state.

a. Real Time Streaming Data

Just like bulk Data, real time streaming data has its own characteristics and ways for ingestion. Streaming real time data is used when data needs to be fed into Hadoop for analytics that comes from when applications, web services, sensors etc. and needs immediate processing. Following are the tools and techniques used to ingest real time data. :

Flume-NG:

As with Bulk upload Flume provides features and ways to ingest real time streaming, the agents once installed can keep listening to source locations as Natcat type and save the data into sink as HDFS .

This process and architecture is similar to Flume bulk and provides the same reliability and tault tolerance.

The only flip side to this process is that Flume cannot process the data on its own and depends on separate Hadoop cluster for processing.

Storm:

Created by twitter , STORM provides an approach that is similar to Hadoop. Storm runs as a cluster of its own and can process huge amount of streaming data directly without the requirement of saving it into any file system(unlike Hadoop) . The data once processed and analyzed can be later saved into big data DSS systems like noSQL databases or other warehousing system for reporting.

Following diagram provides a description of the Storm architecture:

This diagram provides an architectural overview of processing capabilities of Storm. Storm loads the data from a data source which can be a data storage system like Hbase or streaming data like JMS queue or web services.

Storm has 2 components : Spouts and Bolts , Spouts read data from the source and preprocess them before sending to Bolt , Bolt performs the analytics on the data and loads them into some DSS for future use.

KAFKA:

Kafka is a message broker , although this does not provide any big data processing or capabilities in itself but its worth mentioning that kafka has many a times tight coupling with STORM for data ingestion .

In actual scenario the ingestion rate of Streaming data is so huge that it cannot be handled by normal web services which can later feed to Strom systems, hence as a part of the ingestion architecture kafka clusters are used to handle huge volume of streaming data for Storm bolts to read.

S4 :

There are various other processes and tools used for ingestion of streaming data , one such is yahoos S4 .

These systems although provide similar approaches for data ingestion but are not very stable in their release of maintenance.

a. Relational Data:

Most of the cases discussed for ingestion and processing involved dealing with polystructured data of various types and ways , but in most of the cases existing relational data is also an integral part and in most cases analytics needs to be done on data dump from some OLTP systems.

In this case ingestion of relational data with its relationship is important for most Hadoop systems. Hadoop uses the following approaches for relational data ingestion :

SQOOP:

Sqoop is a tool that comes bundles with apache Hadoop and handles data transfer between relational databases like oracle, MySQL and non-relational data storage systems like noSQL DB into HDFS systems. This can also be used to reload the data from HDFS back to RDBMS systems.

Sqoop provides a fail over resistant, highly scalable and distributed computational unit that can upload data and its relationship to and from HDFS systems.

Features of each Ingestion approaches and practical use cases.

This document described and provides information about the importance of poly structured data and how different systems within the Hadoop ecosystem handles the inbound data ingestion into core Hadoop platform .

Now in this section , the document will provide user guide as what are the practical use cases and implementation ideology for each data sources.

Online Streaming Data : Use Cases

Credit Card Fraud in Huge Banks
Detecting Compliance violations
Security Breaches
Machine Failures
Quality Assurance
Supply Chain
Bandwidth Allocation
Order Routing
Routes

Batch Data : Use Cases

Customer Profiling
Also Bought
Job Recommendations
Social Importance
Customer 360 view
Shopping Recommendation
Sentiment Analysis

Relational Data : Use Cases

Transactional Systems
Data Marts
Legacy Systems