Hadoop
Commodity Hardware - Slaves (Named Nodes)
Carrier Grade Hardware - Master (Data Nodes)
EcoSystem - anything out of hadoop files system and map reduce
i.e any thing besides these
1. Name Node
2. Secondary Name node
3. Job Tracker
4. Data Nodes
5. Task Trackers
Hadoop is becoming 'swiss army knife of IT'
Open Source Projects
Move data in or out of Hadoop ( Sqoop, Flume, Storm)
Make hadoop easier to user (Pig, Hive, Oozie)
Build new software on top of hadoop ( Hbase, Impala)
ECOSYSTEM MATRIX
Where can we use Hadoop?
Volume and variety of data
Batch Processing
Parallelise
Recommendation engine
Index Bulding
Text mining / pattern recognition
Risk assessment / threat analysis
Predictive analysis
Fault/ error detection
Common User Cases
Yelp - heavy hadoop user - 400gb log data per day
- search recommendation
- review highlights
- spelling correction
they nearly user hadoop for everything.
High level Architecture
Storage - HDFS
Processing - MapReduce
Definitions
Job - All task which need to run on all data
Task - Individual piece of task analysis
NameNode and Data Node - Are the demons which run on these machines
NN - Name Node, JT - Job Tracker, 2NN-Secondary NN(all master nodes)
DN - Data Node, TT - Task tracker(one set of demon per slave)
MASTER DEMON
NAME NODE
Handles storage meta-data
Keeps meta-data in memory and also in disk
Seconday Name Node
Performs 'checkpointing' ONLY for NameNode
Not a warm, hot, or any sort of failover at all
Job Tracker
Coordinates processing of stored data
Also coordinates scheduling of job processing
Typical cluster has one NN, 2NN and JT
SLAVE DEMON
Data Node -Handles raw data reads and writes
Taks Tracker - Handles individual task assignments
Slave nodes reports 'heartbeats' to the master demons
ie. DN report to NN and TT report to JT
- I am alive( DN and TT)
- here's what I am storing (DN)
- I can take tasks(TT)
- I am working on tasks X,Y(TT)
- I finished tasks X,Y,Z(TT)
Hadoop can run on
1. Monolithic mode - all jobs on same JVM - mocks all demons
2. All 5 JVMs in single machine (NN, 2NN, JT, DN, TT)
3. Fully distributed
HDFS
Distributed File System
Block size if 64 recommend to change to 128 block
Block is loaded by number of replication factor(RF) configured
Files split into chunk and distributed on slaves
Each block is replicated (3x default) for durability and concurrent access
WORM - Write Once and Read Many
No modification of the files possible, if change needed files to be deleted and reloaded
NAME Node -meta-date on disk
NN persists meta data into 2 files
1. fsimage ( point in time snapshot)
2. edit logs (changes made to the snapshot)
fsimage and edit logs are periodically merged.
ADMINS- Always back up fsimage and edit logs
DN reports heart beat every 3 sec. If DN does not report for 10 heartbeats(30 sec) it is taken out of the rotation and the data blocks are moved.
Every hour, DN reports block reports to NN.
Write time checksum is generated on every block write. On every read the checksum is validated.
STEP-BY-STEP WRITE OPERATION
Block write at high level
1. Client cuts input files into block size
2. Client contacts NN to request write operation ( send # of blocks and RF)
3. NN responds with pipeline of DNs to write first block to
4. Client reaches out to first DN in pipeline + performs write. (Client only contacts the 1st node in the pipe line)
5. 2nd and 3rd and all other DN's write directly to each other.
6. After final DN responds, client is notified of successful block write
7. Client Notifies NN of block write
8. NN responds with pipeline of DNs to write second block to
9. Rinse, repeat...
10. Last block written
11. File closed in NN
DATA NEVER FLOWS THROUGH NN
Secondary Node
It's for checkpointing only
On regular basis(every 1hr by default)
- Copies fsimage and edit logs
- Merges editlogs into a new fsimage
- Writes fsimage back to NN
THAT IS ALL IT DOES(no failover). Does only does the compacting to keep hadoop running fast.
SN can be down for a short period of time.
NN can compact, but it will take a longer time as the edit logs could be longer. Thus affecting RPO (Recovery Point Objective) will be longer.
NameNode HA Justification
a. Large Enterprises
b. Internal or External SLAs
c. Multi-day reads-writes
Name node HA was introduced in Hadoop 2.0 or CDH4 (by Cloudera)
Note that this is not secondary NN but a Standby Name Node
How NN HA work?
Client sends operations to both NN and the Standby NN and both have unto date picture of HDFS
Journal Nodes:
fsimage and edit logs could be single point of failure(SPOF).
NN HA resolves that by using set of Journal Nodes(JN)
Through the Quorum Journal Manager(QJM) - which is running in NN
- the QJM sends synchronise update to all the Journal Nodes(JN) (i.e. fsimage and edit logs)
Journal Nodes should always be odd numbers - to avoid 'ties' and 'split-brain' situations.
Each JN has its own disk and usually 3 is sufficient.
When there is split-brain scenario - when both NN and the standby node think they are active
--> Each Quorum Journal Manager sends an epoch number to the JN. When you first start the NN it registers 1 as the epoch number. This means in addition to the edit log and the fsimage it also sends the epoch number.
--> Journal Nodes register the highest epoch number they have ever seen. So when standby NN is activated manually/automatically, the epoch number is guaranteed to increment. At this point if the NN thinks it is active and the Standby thinking it is active, both write to Journal Nodes at the same time. Since Journal Nodes only received on the tightest epoch number the updates from the original NN is dropped as it does not contain the highest epoch number.
Note:On every NN toggle epoch number advances.
High availability configuration in bit complex.
It requires 3-6 additional machines
DONT DO unless you have to.
NN is heavily battle tested and does not break unless there is human error.
Secondary NN on separate hardware is sufficient in most cases.
NAME NODE FEDERATION
Breaking up the NN functionality to several NN. There could be 2 separate directories say 'prod' and 'dev', we could setup once NN name to manage 'prod' directory and another NN machine to manage 'dev directory'.
Isolates HDFS directories from each other.
Allows to infinite scaling on NN
Can be used in combination with HA
Almost never used in Prod - Complexity involved is not worth it.
ACCESS CONTROL
Authentication - "are you who you say you are"
Authorisation - "should you get access to what you're asking"
HDFS does authorisation, but out of the box it does not do Authentication. But it can enable by Kerberos.
It checks the name of the user who runs the command.
HDFS PERMISSIONS: linux like permission (UGO) - users, groups and others
'X' in hdfs is directory browse
'w' in hdfs(files) is delete permission
Kerberos
- Basically a ticketing system