Hadoop

Commodity Hardware - Slaves (Named Nodes)

Carrier Grade Hardware - Master (Data Nodes)

EcoSystem - anything out of hadoop files system and map reduce

i.e any thing besides these

1. Name Node

2. Secondary Name node

3. Job Tracker

4. Data Nodes

5. Task Trackers

Hadoop is becoming 'swiss army knife of IT'

Open Source Projects

Move data in or out of Hadoop ( Sqoop, Flume, Storm)

Make hadoop easier to user (Pig, Hive, Oozie)

Build new software on top of hadoop ( Hbase, Impala)

ECOSYSTEM MATRIX

Where can we use Hadoop?

    • Volume and variety of data

    • Batch Processing

    • Parallelise

Recommendation engine

Index Bulding

Text mining / pattern recognition

Risk assessment / threat analysis

Predictive analysis

Fault/ error detection

Common User Cases

Yelp - heavy hadoop user - 400gb log data per day

- search recommendation

- review highlights

- spelling correction

they nearly user hadoop for everything.

High level Architecture

Storage - HDFS

Processing - MapReduce

Definitions

Job - All task which need to run on all data

Task - Individual piece of task analysis

NameNode and Data Node - Are the demons which run on these machines

NN - Name Node, JT - Job Tracker, 2NN-Secondary NN(all master nodes)

DN - Data Node, TT - Task tracker(one set of demon per slave)

MASTER DEMON

NAME NODE

Handles storage meta-data

Keeps meta-data in memory and also in disk

Seconday Name Node

Performs 'checkpointing' ONLY for NameNode

Not a warm, hot, or any sort of failover at all

Job Tracker

Coordinates processing of stored data

Also coordinates scheduling of job processing

Typical cluster has one NN, 2NN and JT

SLAVE DEMON

Data Node -Handles raw data reads and writes

Taks Tracker - Handles individual task assignments

Slave nodes reports 'heartbeats' to the master demons

ie. DN report to NN and TT report to JT

- I am alive( DN and TT)

- here's what I am storing (DN)

- I can take tasks(TT)

- I am working on tasks X,Y(TT)

- I finished tasks X,Y,Z(TT)

Hadoop can run on

1. Monolithic mode - all jobs on same JVM - mocks all demons

2. All 5 JVMs in single machine (NN, 2NN, JT, DN, TT)

3. Fully distributed

HDFS

    • Distributed File System

      • Block size if 64 recommend to change to 128 block

      • Block is loaded by number of replication factor(RF) configured

    • Files split into chunk and distributed on slaves

    • Each block is replicated (3x default) for durability and concurrent access

WORM - Write Once and Read Many

No modification of the files possible, if change needed files to be deleted and reloaded

NAME Node -meta-date on disk

NN persists meta data into 2 files

1. fsimage ( point in time snapshot)

2. edit logs (changes made to the snapshot)

fsimage and edit logs are periodically merged.

ADMINS- Always back up fsimage and edit logs

DN reports heart beat every 3 sec. If DN does not report for 10 heartbeats(30 sec) it is taken out of the rotation and the data blocks are moved.

Every hour, DN reports block reports to NN.

Write time checksum is generated on every block write. On every read the checksum is validated.

STEP-BY-STEP WRITE OPERATION

Block write at high level

1. Client cuts input files into block size

2. Client contacts NN to request write operation ( send # of blocks and RF)

3. NN responds with pipeline of DNs to write first block to

4. Client reaches out to first DN in pipeline + performs write. (Client only contacts the 1st node in the pipe line)

5. 2nd and 3rd and all other DN's write directly to each other.

6. After final DN responds, client is notified of successful block write

7. Client Notifies NN of block write

8. NN responds with pipeline of DNs to write second block to

9. Rinse, repeat...

10. Last block written

11. File closed in NN

DATA NEVER FLOWS THROUGH NN

Secondary Node

It's for checkpointing only

On regular basis(every 1hr by default)

- Copies fsimage and edit logs

- Merges editlogs into a new fsimage

- Writes fsimage back to NN

THAT IS ALL IT DOES(no failover). Does only does the compacting to keep hadoop running fast.

SN can be down for a short period of time.

NN can compact, but it will take a longer time as the edit logs could be longer. Thus affecting RPO (Recovery Point Objective) will be longer.

NameNode HA Justification

a. Large Enterprises

b. Internal or External SLAs

c. Multi-day reads-writes

Name node HA was introduced in Hadoop 2.0 or CDH4 (by Cloudera)

Note that this is not secondary NN but a Standby Name Node

How NN HA work?

Client sends operations to both NN and the Standby NN and both have unto date picture of HDFS

Journal Nodes:

fsimage and edit logs could be single point of failure(SPOF).

NN HA resolves that by using set of Journal Nodes(JN)

Through the Quorum Journal Manager(QJM) - which is running in NN

- the QJM sends synchronise update to all the Journal Nodes(JN) (i.e. fsimage and edit logs)

Journal Nodes should always be odd numbers - to avoid 'ties' and 'split-brain' situations.

Each JN has its own disk and usually 3 is sufficient.

When there is split-brain scenario - when both NN and the standby node think they are active

--> Each Quorum Journal Manager sends an epoch number to the JN. When you first start the NN it registers 1 as the epoch number. This means in addition to the edit log and the fsimage it also sends the epoch number.

--> Journal Nodes register the highest epoch number they have ever seen. So when standby NN is activated manually/automatically, the epoch number is guaranteed to increment. At this point if the NN thinks it is active and the Standby thinking it is active, both write to Journal Nodes at the same time. Since Journal Nodes only received on the tightest epoch number the updates from the original NN is dropped as it does not contain the highest epoch number.

Note:On every NN toggle epoch number advances.

High availability configuration in bit complex.

It requires 3-6 additional machines

DONT DO unless you have to.

NN is heavily battle tested and does not break unless there is human error.

Secondary NN on separate hardware is sufficient in most cases.

NAME NODE FEDERATION

Breaking up the NN functionality to several NN. There could be 2 separate directories say 'prod' and 'dev', we could setup once NN name to manage 'prod' directory and another NN machine to manage 'dev directory'.

Isolates HDFS directories from each other.

Allows to infinite scaling on NN

Can be used in combination with HA

Almost never used in Prod - Complexity involved is not worth it.

ACCESS CONTROL

Authentication - "are you who you say you are"

Authorisation - "should you get access to what you're asking"

HDFS does authorisation, but out of the box it does not do Authentication. But it can enable by Kerberos.

It checks the name of the user who runs the command.

HDFS PERMISSIONS: linux like permission (UGO) - users, groups and others

'X' in hdfs is directory browse

'w' in hdfs(files) is delete permission

Kerberos

- Basically a ticketing system