This diagram depicts the Hadoop ecosystem holistically, in any Hadoop system one of the major areas of concern is to identify sources of data and determine the ways to ingest the data into Hadoop system. Data of varied schema and relationship and of various types and formats needs to be stored in Hadoop system in order to get meaningful incites. This process of inbound data transfer from external system into Hadoop core forms a major scope in designing robust Hadoop architectures.
This document focuses on the different ingestion approaches, different data types and different processes to get external data into Hadoop ecosystem.
This diagram also shows the Hadoop cluster and its interaction with Data storage system layer (DSS) and different visualization and Analytics tools used for reporting which although forms an integral part of the data architecture but is considered out of scope for this document.
How Polystructured data requires special handling than traditional data
Traditional system handles data that have a schema and relationships attached to it very easily. Data Storage Systems like RDBMS handles the relationship between similar data that adheres to a specific schema very efficiently.
Traditional systems also provide analytics and incites on these data when the volume of data is manageable. These types of system is ideal for Transactional data that are mostly found in OLTP (Online Transaction Processing Systems) .
However as discussed in this document, it’s not sufficient to handle data that is only driven by the business. Also most of the information generated in today’s world does not only come from core systems but are generated by applications and tools that are part of the entire business process.
To be honest most of the information and business incites are tapped in data that gets generated using logs, call center chats, emails, sms, voice communication and lots of other channels that forms part of customers interaction with different interfaces other than only the transactional systems.
The data that’s gets generated from these diverse systems and services are different in nature and most importantly these data are of different types and formats. However it is equally important to look into the data from a single point of view because together these data tells a whole story.
Because of the diverse forms of the data and its importance to analyze the same together it’s impossible to use traditional data storage systems to store and analyze these schemas less diverse data together to perform complex analytics.
This type of data which although are varied by their schema and structure but are bound together by a common use case that needs to be processed together to provide a complete picture is called poly structured data.
Polystructured data can be simple chat information which when analyzed can tell us about the sentiments of the people involved , or can be email communication in any company or can be as complex as images and social status information or can be combination of all.
Each of these polystructured data needs to be treated separately but needs to be saved in a common platform at the end of the day for processing.
Hadoop provides a platform where these data generated from various ways and processes are saved together in such a way that processing and analytics of these data is easily manageable.
Each data is different and has to be treated differently, and most importantly these data needs to be transformed and loaded into Hadoop in such a way that all data coming from different sources can be managed in Hadoop for processing. This process is called ETL (Extract Load and Transform). ETL process provides a way where various data ingested can be converted into a common format that can be processed by the Hadoop systems.
This section describes how different data with their characteristics can be ingested into the Hadoop system so that the processing can take place.
This section also describes the polystructured data and how Hadoop uses different ingestion mechanisms to ensure that ETL processes can take the data and convert them into information that can be loaded into Hadoop for further processing.
The most important distinction between the polystructured data is the way the data needs to be entered into Hadoop system. Data can be uploaded by the clients in bulk or can be ingested real time over network or can be ingested as part of existing relational data base model or can be combination of one or many.
Also data ingestion can also vary based on the importance of the data. Some data can require more reliability and fail over assurance while other data might be limited by rate of ingestion and time to live.
That’s not all, there are also many industry standard storage systems available that acts as de facto standard way to store huge amount of data. These systems require special handling for ingestion data into Hadoop systems.
This document will cover the various ingestion mechanisms and the rationale for each ingestion mechanism that needs to be used for each process.
Why is polystructured data important for Hadoop
This section tries to provide incite as why polystructured data is important for Hadoop and why data should be treated equally irrespective of the origin.
Consider a situation where bank wants to put up an ATM machine , this required a lot of analytics as which location is most profitable , communications are easy and most people tend to visit.
But apart from these important information that can be used to identify potential location of putting up an ATM is to identify customer complaints about lack of ATMs for that area. This information is usually very important and not readily available in structured environment like databases but can be acquired if we look into other places: like customer care chats , voice calls to bank front office or even by checking some social media .
This information are definitely not organized or not structured but they contain valuable information that can impact a major decision like location to put up ATMs.
Consider another scenario where a company has an email exchange server that prevents unsolicited emails and spams. Normally this type of server needs to read email information and scan any attachments before labelling it as safe. If the company is huge or the volume of emails coming from the exchange server is large then scanning and labelling the correct emails from the spams cannot be done using traditional systems.
These examples above stated the underlying fact that most business decisions of important business process takes its initiative by requiring to process that which is poly structured. Data that although have relationships with each other but differ in their structure and formation. And because these polystructured data are so different it required different techniques to process and store them.
The section although provide a brief about the importance of polystructured data , in next section it will try to address different processes and ways to handle these diverse forms of data.
How Hadoop Handles Polystructured data
As data to be ingested is classified into various types, the process and way to ingest these inbound traffic also changes in Hadoop ecosystem, sometimes it becomes important to consider the volume part of the data when time is a constraint and sometimes it becomes important to prioritize the velocity of the data being ingested with the network and hardware capacity being a constraint.
Hadoop classifies and distinguishes the data that needs to be ingested into the following types and provides different set of tools for handling each set:
Bulk data is a primary source of data that needs to be analyzed and processed in Hadoop systems. Bulk data normally refer to huge set of logs files generated by application logs or system logs that needs to be processed for analytics.
Normally bulk data runs into terabytes to petabytes in sizes. This huge set of data needs to be ingested into Hadoop within a given time frame. In such a scenario traditional systems can run into days to weeks to ingest the data which negates the requirement of analyzing the data for business incites because by the time business incites are drawn the data might have changed considerably.
Hadoop provides the following tools to ingest the huge data set.
Flume-ng:
Flume-ng(next generation) is a tool that performs data ingestion from any source (like folders, FTP locations, network drives etc) into HDFS over Hadoop.
The following diagram describes the Flume architecture.