BIG DATA HADOOP TRAINING

Overview Of Hadoop Training

Hadoop is an open source platform that provides excellent data management provision. It is a framework that supports the processing of large data sets in a distributed computing environment. It is designed to expand from single servers to thousands of machines, each providing computation and storage. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure, which minimizes the risk of catastrophic system failure, even if a significant number of nodes become out of action. Hadoop is very valuable for large scale businesses basing on its proven benefits for enterprises

Advantages for Enterprises:

Hadoop provides a cost effective storage solution for business.
It facilitates businesses to easily access new data sources and tap into different types of data to produce value from that data.
It is a highly scalable storage platform.
Unique storage method of Hadoop is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing.
Hadoop is now widely used across industries, including finance, media and entertainment, government, healthcare, information services, retail, and other industries
Hadoop is fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
Hadoop is more than just a faster, cheaper database and analytics tool. It is designed as a scale-out architecture that can affordably store all of a company’s data for later use

Hadoop Syllabus

BigData

What is Big Data?
Why all industries are talking about Big Data?
What are the issues in Big Data?
What are the challenges for storing big data?
What are the challenges for processing big data?
What are the technologies support big data?

Hadoop

What is Hadoop?
History of Hadoop
Why Hadoop?
Hadoop Use cases
Advantages and Disadvantages of Hadoop
Importance of Different Ecosystems of Hadoop
Importance of Integration with other BigData solutions
Big Data Real time Use Cases

HDFS (Hadoop Distributed File System)

HDFS Architecture

Daemons in Hadoop

Name Node
Secondary Name Node
Data Node

Data Storage in HDFS

HDFS Block size

HDFS Replication factor

Accessing HDFS

HDFS Commands

Configurations

How to overcome the Drawbacks in HDFS

How to configure the Hadoop Cluster

How to add the new nodes ( Commissioning )
How to remove the existing nodes ( DE-Commissioning )
How to verify the Dead Nodes
How to start the Dead Nodes

Hadoop 2.x.x version features

Introduction to Namenode fedoration
Introduction to Namenode High Availability
Difference between Hadoop x.x and Hadoop 2.x.x versions

MapReduce Architecture

JobTracker

Importance of JobTracker
What are the roles of JobTracker
What are the drawbacks in JobTracker

TaskTracker

Importance of TaskTracker
What are the roles of TaskTracker
What are the drawbacks in TaskTracker

Data Types in Hadoop

What are the Data types in Map Reduce
Why these are importance in Map Reduce
Can we write custom Data Types in MapReduce

Input Formats in Map Reduce

Text Input Format
Key Value Text Input Format
Sequence File Input Format
Nline Input Format
Importance of Input Format in Map Reduce
How to use Input Format in Map Reduce
How to write custom Input Formats and its Record Readers

Output Formats in Map Reduce

Text Output Format
Sequence File Output Format
Importance of Output Format in Map Reduce
How to use Output Format in Map Reduce
How to write custom Output Format’s and its Record Writers

Mapper

What is mapper in Map Reduce Job
Why we need mapper?
What are the Advantages and Disadvantages of mapper
Writing mapper programs

Reducer

What is reducer in Map Reduce Job
Why we need reducer?
What are the Advantages and Disadvantages of reducer
Writing reducer programs

Driver

What is Driver in Map Reduce Job
Why we need Driver?
Writing Driver program

Input Split

InputSplit
Need Of Input Split in Map Reduce
lnputSplit Size
InputSplit Size Vs Block Size
InputSplit Vs Mappers
Map Reduce Job execution flow

Combiner

What is combiner in Map Reduce Job
Why we need combiner?
What are the Advantages and Disadvantages of Combiner
Writing Combiner programs
Identity Mapper and Identity Reducer

Partitioner

What is Partitioner in Map Reduce Job
Why we need Partitioner?
What are the Advantages and Disadvantages of Partitioner
Writing Partitioner programs

Distributed Cache

What is Distributed Cache in Map Reduce Job
Importance of Distributed Cache in Map Reduce job
What are the Advantages and Disadvantages of Distributed Cache
Writing Distributed Cache programs

Counters

What is Counter in Map Reduce Job
Why we need Counters in production environment?
How to Write Counters in Map Reduce programs

Importance of Writable and Writable Comparable Api’s

How to write custom Map Reduce Keys using Writable
How to write custom Map Reduce Values using Writable Comparable

Joins

Map Side Join

What is the importance of Map Side Join
Where we are using it

Reduce Side Join

What is the importance of Reduce Side Join
Where we are using it
What is the difference between Map Side join and Reduce Side Join?

Compression techniques

Importance of Compression techniques in production environment
Compression Types
NONE, RECORD and BLOCK
Compression Codecs
Default, Gzip, Bzip, Snappy and LZO
Enabling and Disabling these techniques for all the Jobs
Enabling and Disabling these techniques for a particular Job

Map Reduce Programming Model

How to write the Map Reduce jobs in Java
Running the Map Reduce jobs in local mode
Running the Map Reduce jobs in pseudo mode
Running the Map Reduce jobs in cluster mode

Debugging Map Reduce Jobs

How to debug Map Reduce Jobs in Local
How to debug Map Reduce Jobs in Remote

YARN (Next Generation Map Reduce)

What is YARN?
What is the importance of YARN?
Where we can use the concept of YARN in Real Time
What is difference between YARN and Map Reduce

Data Locality

What is Data Locality?
Will Hadoop follows Data Locality?

Speculative Execution

What is Speculative Execution?
Will Hadoop follows Speculative Execution?

Map Reduce Commands

Importance of each command
How to execute the command
Mapreduce admin related commands explanation

Configurations

Can we change the existing configurations of mapreduce or not?
Importance of configurations
Writing Unit Tests for Map Reduce Jobs
Use of Secondary Sorting and how to solve using MapReduce
How to Identify Performance Bottlenecks in MR jobs and tuning MR
Map Reduce Streaming and Pipes with examples
Exploring the Apache MapReduce Web UI

Apache Pig

Introduction to Apache Pig
Map Reduce Vs Apache Pig
SQL Vs Apache Pig
Different data types in Pig

Modes of Execution in Pig

Local Mode
Map Reduce Mode

Execution Mechanism

Grunt Shell
Script
Embedded

UDF’s

How to write the UDF’s in Pig
How to use the UDF’s in Pig
Importance of UDF’s in Pig

Filter’s

How to write the Filter’s in Pig
How to use the Filter’s in Pig
Importance of Filter’s in Pig

Load Functions

How to write the Load Functions in Pig
How to use the Load Functions in Pig
Importance of Load Functions in Pig

Store Functions

How to use the Store Functions in Pig
Importance of Store Functions in Pig
Transformations in Pig
How to write the complex pig scripts
How to integrate the Pig and Hbase

Apache HIVE

Hive

Introduction Hive architecture

Driver
Compiler
Semantic Analyzer
Hive Integration with Hadoop
Hive Query Language(Hive QL)
SQL VS Hive QL
Hive Installation and Configuration
Hive, Map-Reduce and Local-Mode
Hive DLL and DML Operations

Hive Services

CLI
Hiveserver
Hwi

Metastore

embedded metastore configuration
external metastore configuration

UDF’s

How to write the UDF’s in Hive
How to use the UDF’s in Hive
Importance of UDF’s in Hive

UDAF’s

How to use the UDAF’s in Hive
Importance of UDAF’s in Hive

UDTF’s

How to use the UDTF’s in Hive
Importance of UDTF’s in Hive
How to write a complex Hive queries
What is Hive Data Model?

Partitions

Importance of Hive Partitions in production environment
Limitations of Hive Partitions
How to write Partitions

Buckets

Importance of Hive Buckets in production environment
How to write Buckets

SerDe

Importance of Hive SerDe’s in production environment
How to write SerDe programs
How to integrate the Hive and Hbase

Apache Zookeeper

Introduction to zookeeper
Pseudo mode installations
Zookeeper cluster installations
Basic commands execution

Apache HBase

Hbase introduction
Hbase use cases
Hbase basics
Column families
Scans

Hbase installation

Local mode
Pseudo mode
Cluster mode

Hbase Architecture

Storage
Write Ahead Log
Log Structured Merge Trees

Mapreduce integration

Mapreduce over Hbase

Hbase Usage

Key design
Bloom Filters
Versioning
Coprocessors
Filters

Hbase Clients

REST
Thrift
Hive
Web Based UI

Hbase Admin

Schema definition
Basic CRUD operations

Apache Sqoop

Introduction to Sqoop
MySQL client and Server Installation
Sqoop Installation
How to connect to Relational Database using Sqoop
Sqoop Commands and Examples on Import
and Export commands

Apache Flume

Introduction to flume
Flume installation
Flume agent usage and Flume examples execution

Apache Oozie

Introduction to oozie
Oozie installation
Executing oozie workflow jobs
Monitoring Oozie workflow jobs

MongoDB

Introduction to MongoDB
MongoDB installation
MongoDB examples

Download Syllabus

Page updated

Google Sites

Report abuse