The Data Blog. - Analytics Use Case Demo : Using Spark

Analytics Use Case Demo : Using Spark

AIM

To demonstrate an use case for Analytics Use Case Demo : Using Spark showcasing data ingestion, data processing and data visualization.

The main Idea of the use case demo is to create a Big Data Hadoop platform to showcase different Data Architecture processes to get business incites from a large data set.

Objective

The major Objective of the analytics process are the following:

Create a 4 node Cloudera Cluster with Spark , Hive .
Setup Flume Ingestion Box for data collection from external source.
Implement Flume Agent to ingest data into Hadoop cluster .
Implement Data Analytics using Spark to analyze the ingested data.
Load Analyzed data into Data warehouse.
Install tableau and setup data visualization.

Technology Stack

The following diagram shows the technology stack for the implementation. This provides an incite of the technologies used in the use case and the roadmap for each.

Logical Block Diagram

This diagram show cases the logical block diagram for the use case. Here we have a CDH5.1.2 cluster with Hadoop services installed. The services include HDFS Namenode and Datanode , Spark Master and worker , Hive Gateways and thrift service connected to MySQL metastore.

Flume is installed as an external service in a separate instance that acts as Ingestion box. This ingestion box can be installed in public facing FTP server which can access data from external world and act as a repository.

Tableau instance is installed as a separate instance which has access to Hive Thrift service . This instance is a windows server with tableau desktop installed.

Installation Component Diagram

This use case requires 3 separate installation process as per the block diagram:

Cloudera Cluster provides Cluster Manager, HDFS , Nanenode , Spark , Hive services installed as a bundle.
Flume needs to be installed separately as a yum package.
Tableau has to be installed as msi package as windows exe.

Installation Process

Cloudera Manager

Install Cloudera manager from the following link:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_install_path_A.html?scroll=cmig_topic_6_5

Cloudera manager REST API link :

http://cloudera.github.io/cm_api/apidocs/v5/

Flume Installation

Yum install flume-ng

Tableau Installation

Install Tableau following the below document

http://kb.tableausoftware.com/articles/knowledgebase/downloading-tableau-products

Understanding Data

Sample data looks like following:

Here each record shows a crime event published by Chicago Crime division , the first column is the case number , second one the time third address , forth the crime type etc.

The main objective of this analytics is to categorize crime type by zip so that we can get an overview of the type and number of crimes for each zip. This information can be used and loaded to identify place in Chicago where crime rate is low and inturn it will help us to make rational decisions as which place to move and where to buy property in Chicago city.

Implementation

Flume Ingestion of Data into HDFS

This step shows how data is loaded into HDFS from Flume service. The following implementation is executed by Flume Agent

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir = /data02/rawdata

a1.sources.r1.fileHeader = true

# Describe the sink

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = hdfs://12.cs1cloud.internal:8020/cleanseddata

a1.sinks.k1.hdfs.writeFormat = Text

a1.sinks.k1.hdfs.rollCount=0

a1.sinks.k1.hdfs.rollInterval=2000

a1.sinks.k1.hdfs.rollSize = 0

a1.sinks.k1.hdfs.batchSize =500000

a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

Spark Analytics

Spark Analytics is coded as a Scala and Java Implementation. The business logic is as follows :

package com.idh.driver;import java.util.regex.Pattern;import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

public class CrimeAnalyticsDriver {

private static Pattern COMMA = Pattern.compile(“,”);

public static void main(String[] args) {

try

{

System.out.println(“Initiating Crime Analytics Job…..”);

JavaPairRDD<String, Integer> pairInputData = input.map(new PairFunction<String,String,Integer>() {

@Override

public Tuple2<String, Integer> call(String arg0)

throws Exception {

String splits[] = COMMA.split(arg0);

String zip = splits[10];

String crime = splits[5];

return new Tuple2<String, Integer>(zip+”,”+crime, 1);

}

});

JavaPairRDD<String, Integer>reduceData = pairInputData.reduceByKey(new Function2<Integer, Integer, Integer>() {

@Override

public Integer call(Integer arg0, Integer arg1) throws Exception {

return arg0+arg1;

}

});

JavaRDD<String> resultData = reduceData.map(new Function<Tuple2<String,Integer>, String>() {

@Override

public String call(Tuple2<String, Integer> arg0)

throws Exception {

return new String(arg0._1+”,”+arg0._2);

}

});

context.stop();

}

catch(Exception ex)

{

ex.printStackTrace();

}

Loading data Into Hive Warehouse

The following script loads data into the Hive Warehouse

sudo -s <<EOFhadoop fs -rm -r -skipTrash /crimeAnalytics_CrimeCount_By_Zip/_SUCCESShive -e “drop table crimecountzip if exists”hive -e “create external table crimecountzip ( zip String ,type String,count String)row format delimited fields terminated by ‘,’ ”

hive -e “load data inpath ‘/crimeAnalytics_CrimeCount_By_Zip/’ into table crimecountzip”

EOF

cd /scripts

Tableau Visualization

Following are the visualization charts provided for data visualization :