Intention
The intention of this benchmarking is to find a Hadoop Big Data Business case , identify the services for the business solution , implement the business solution and benchmark the solution as a whole.
This document describes how we can choose a set of services that together provide a end to end hadoop big data solution and see how we can select one single component (one at a time )from the entire setup and provide a benchmarking on that piece of service.
Thoughout this benchmarking we will work on a single business case focusing on the wholestic solution and see how we can achieve a better result based on the statistics.
Business Case
The business case for our benchmarking is a case study where we try to fit in complexities of processing , analytics , data ingestion and rule based computation.
The business case we are demonstrating here is Fraud Detection on Transactional Data for Credit Card usage. This use case is in bulk mode where we plan to accomplish the following
1.Data ingestion from S3 (bulk data transfer benchmarking)
2.ETL to cleanse bulk data (HDFS benchmarking and Pig benchmarking)
3.Analytics model to benchmark the Business case with some business rules as follows
1.Amount Exceeds limit Business case to demonstrate HDFS search and filtering.
2.Expiry Card business case to demonstrate HDFS querying process.
3.Maximum Transaction per hour business case to demonstrate complex memory processing.
4.Country locator business case to demonstrate complex cpu processing.
5.Card variance business case to demonstrate complex cpu processing with memory processing.
4.Loading Data into Warehouse for Business analytics and Dashboard.
Choice of tools – Initial benchmarking (Iteration1)
For this business process we started off with Cloudera CDH 4.8 distribution with the following choice of tools:
1.Using Flume/distcp/s3distcp to ingest data into HDFS from S3. Data for processing is loaded into S3 and is transferred using flume to HDFS .
2.Using pig to process the raw data and cleanse the data to for analytics. Here with CDH4.8 the pig runs on MRv1.
3.Using Custom written MR job with Spring Hadoop framework to implement the 5 business rules. Data is read from the cleansed folder and processed data is loaded into an output folder. Here also the MR version is MRv1.
4.The processed data is loaded into Hive for visualization and realization. The result of separate business models are loaded into a single table with a specific schema for Tableau to query.
** This benchmarking started on ADD when CDH highest version was 4.8 with MRv1 as the processing tool. Here the end to end business process was to create an ingestion channel using Flume, perform ETL using pig (as any custom tool will not be able to process that huge data) , run some business models using Mapreduce custom code and finally use Hive as a data warehouse.
Initial benchmarking (Iteration1)
We started off with 2 TB data Credit Card Data on 16 node hadoop cluster with m1.xlarge instances.
Following are the statistics on Data ingestion into HDFS
Process Type
Process Description
Time
Approach 1
S3distcp from S3 to EC2 HDFS
EC2 CDH4 cluster on private subnet of VPC
8 hrs 36 min
Approach 2
S3distcp from S3 to EMR , then from EMR to EC2 HDFS
EMR cluster on public subnet and EC2 cluster on private subnet
5 hrs 19 min
Approach 3
Flume using EBS Volumes
Data loaded directly into EBS volumeand later loaed into HDFS using Flume
5 hr 18 min
Statistics on Pig as ETL tool
Process Type
Process Description
Time
ETL
Using PIG to perform ETL on raw data
Pig Running as MRv1
4 hrs 28 min
Statistics on MR process
Process Type
Process Description
Time
Business Process
Using Custom Hadoop-Spring MR job executing 5 rules
Executing custom MR job using MRv1
6 hrs 27 min
** These results are the best iteration result after rounds of cluster level tuning.
Initial benchmarking (Iteration1 learning)
With the round one testing we found the following details
1.Ingestion process works well with either having a EMR cluster in between S3 and EC2 cluster or using the data to be loaded using flume from EBS backed volume .
2.For ETL , pig performs well . Pig can be a suitable ETL tool of choice.
3.For MR job , the process looks a bit slow. After rounds of performance tuning (cluster level )we were able to get the best results fro MR job(s) .
4.The output was loaded into HIVE tables which was used by Tableau to create dashboards, the tableau extract creation over Hive Server had considerable time.
Choice of tools – Second phase benchmarking (Iteration2)
In this round we upgraded cluster to CDH 5.0.1 distribution with the following choice of tools:
1.Ingestion tool does not change in this round as there are no dependency into the hadoop services, we are still using Flume/distcp/s3distcp to ingest data into HDFS from S3. No statistics are collected for this round on ingestion.
2.We are using pig to process the raw data and cleanse the data to for analytics. In this round we do not change the pig process but change the underlying engine to YARN instead of MRv1
3.The Custom written MR job with Spring Hadoop framework to implement the 5 business rules remains the same . Here also the underlying engine is changes to YARN.
4.The processed data is loaded into Hive for visualization and realization. The Hive will now use HiveServer2 along with YARN as the processing engine.
** In this benchmarking we used the CDH highest version then which was 5.0.1 with YARN as the processing tool.
Second phase benchmarking (Iteration2)
In this round we used 2 TB data Credit Card Data on 16 node hadoop cluster with m1.xlarge instances.
Statistics on Pig as ETL tool
Process Type
Process Description
Time
ETL
Using PIG to perform ETL on raw data
Pig Running as MRv1
3 hrs 15 min
Statistics on MR process
Process Type
Process Description
Time
Business Process
Using Custom Hadoop-Spring MR job executing 5 rules
Executing custom MR job using MRv1
4 hrs 28 min
** These results are the best iteration result after rounds of cluster level tuning for YARN.
** Flume and Ingestion process remains the same as before hence not recorded
Second phase benchmarking (Iteration2 learning)
With the round two testing we found the following details
1.For ETL , pig performs well . After introduction of YARN as the processing engine the performance improvement was significant.
2.For MR job , the business logic are being executed better and faster after moving into YARM mode.
3.The output was loaded into HIVE tables which was used by Tableau to create dashboards, the tableau extract creation over Hive Server 2 is considerably slower.
Choice of tools – Third phase benchmarking (Iteration3)
In this round we upgraded cluster to CDH 5.2.1 distribution with the following choice of tools:
1.Ingestion tool does not change in this round also as there are no dependency into the hadoop services, we are still using Flume/distcp/s3distcp to ingest data into HDFS from S3. No statistics are collected for this round on ingestion.
2.We are using pig to process the raw data and cleanse the data to for analytics. In this round we do not change the pig process and keep on changing the underlying engine to YARN.
3.The Custom written MR job is now changed to work with Spark . Spark is an in-memory processing engine that was introduced with CDH 5.2.1 Here we recoded the 5 business rules to use Spark engine for business model processing .
4.For the processing we introduced Impala , the in-memory query engine that can work as a data-processing tool on Hive tables.
** In this benchmarking we used the CDH highest version which is 5.2.1 with YARN as the processing tool.
In this round we used 2 TB data Credit Card Data on 7(high performance) node hadoop cluster with c3.8xlarge instances.
Statistics on Pig as ETL tool
Process Type
Process Description
Time
ETL
Using PIG to perform ETL on raw data
Pig Running as MRv1
1 hr 45 min
Statistics on Spark process
Process Type
Process Description
Time
Business Process
Convert MR business rules to Spark and execute as Spark process
Execute Spark job on cluster
2 hrs 17 min
Statistics on Impala Queries
Process Type
Process Description
Time
Business Process
Convert MR business rules to Impala queries and execute the queries on Impala
Execute Impala queries
2 hrs 37 min
** These results are the best iteration result after rounds of cluster level tuning for Spark , Yarn and Impala.
** Flume and Ingestion process remains the same as before hence not recorded.
Third phase benchmarking (Iteration3 learning)
With the round two testing we found the following details
1.In this round we got the best performance numbers for Yarn pig process , this round had been the best w.r.t performance.
2.After Introduction of Spark we have got the best performance values for business models , all 5 rules now are executed in considerably lower time.
3.We have tried to use Impala in two ways , firstly we used impala to be used only as a query processing engine with tableau . In this case we saw tableau was able to get performance than Hive .
Third phase benchmarking (Iteration3 learning – Impala)
With Impala performance testing we had the following considerations:
1 . Use Impala itself as a processing engine (instead of spark or mapreduce) , this will help us in case data is structured and schema based (like the data in this case).
Use Impala process to be executed by the Tableau engine directly and execute the business cases ( 5 rules) from Tableau connecting to hadoop.
with case 1 , we have the following statistics
Process Type
Process Description
Time
Business Process
Execute 5 impala queries specific to each rules and execute the queries serially
Execute Impala queries serially
2 hrs 37 min
Execute the same Impala queries in parallel
Load testing
3 hrs 58 min
In this case , when we used impala shell to test multiple queries running at the same time the processing time increases as expected .
With case 2 , tableau was given the responsibility to generate dashboards directly firing custom impala queries , we had the following observations
1.Tableau cannot create extract directly from custom queries, tableau needs to fire the complex queries initially and then after the query it again creates an extract .
2.Tableau cannot use custom queries in tandem extract and hence for any visualization the idea is to use Spark , MR or impala as backend and only generate extract from tableau instead of firing any custom tableau queries.
Overall Learning
We can take away the following learnings from the performance tests
1.Flume, S3 ingestion over EMR can be effective way for ingesting huge data into HDFS.
2.Pig Can be a ideal ETL tool over large data in HDFS , pig works better in YARN mode with optimal tuning.
3.Business rules can be executed as MR jobs or Spark jobs, incase of memory based computations Spark can be a better option .
4.Impala can be preferred over hive for custom queries. Tableau connects better with impala for dashboards.
5.Impala also works well with multi users working at the same time, multiple Tableau can execute queries on impala at the same time .
6.Tableau should not be used to execute custom query to create dashboard. Tableau should be used to connect to impala tables for extracts or live connections only.