Hadoop Cluster
Post date: Feb 09, 2019 1:15:22 PM
Hadoop Clusters, two main terms come up, they are cluster and node, so on defining them:
A collection of nodes is what we call the cluster.
A node is a point of intersection/connection within a network, ie a server
Hadoop clusters have two types of machines, such as Master and Slave, where:
Projecting Required Big Data Capacity
We start with 1 TB of daily data from Year 1 and assume 15% data growth per quarter. Further, assuming a 15% year on-year growth in data volumes and 1,080 TB of data in Year 1, by the end of Year 5 the capacity may grow to 8,295 TB of data. If we were to assume a 30% year-on-year growth in data volumes and 1080 TB of data in Year 1, then by the end of Year 5, the capacity might grow to 50,598 TB of data. The following formula can be used to estimate Hadoop storage and arrive at the required number of data nodes:
Hadoop Storage (H) = C*R*S/(1-i)
Legend
C: Average compression ratio
R: Replication factor
S: Size of data to be moved to Hadoop
i: Intermediate factor
Estimating Required Hadoop Storage and Number of Data Nodes
With no compression, C equals 1. The replication factor is assumed to be 3 and the intermediate factor 0.25 or ¼. The calculation for H in this case becomes:
H= 1*3*S/(1-(1/4)) = 3*S/(3/4) = 4*S
Master
Slaves
HDFS NameNode,
YARN ResourceManager
HDFS DataNodes
YARN NodeManagers
Notes Hortonworks recommends separating master and slave nodes because: • Task/application workloads on the slave nodes should be isolated from the masters. • Slaves nodes are frequently decommissioned for maintenance
Node Configuration :-> Two types of important configuration files, Hadoop’s Java configuration is driven:
Read-only default configuration
Site-specific configuration :
core-default.xml,
Hdfs-default.xml
yarn-default.xml
Mapred-default.xml
etc/hadoop/core-site.xml
etc/hadoop/hdfs-site.xml
etc/hadoop/yarn-site.xml
etc/hadoop/mapred-site.xml
The required Hadoop storage in this instance is estimated to be four times the initial data size. The following formula can be used to estimate the number of data nodes:
(n) = H/D = C*R*S/(1-i)/D
Hadoop daemons execute and also the configuration parameters for the Hadoop daemons, in order to configure the Hadoop cluster.
D: Disk space available per node Let us assume that 8 TB is the available disk space per node, each node comprising 10 disks of 1 TB capacity each, minus 2 disks for operating system. Also, assuming the initial data size to be 600 TB:
HDFS daemons
YARN daemons
NameNode, and DataNode
ResourceManager, and NodeManager
N = 600/8 = 75
Notes If complex processing is anticipated, then it is recommended to have at least 10% additional vacant space to accommodate such processing. This 10% is an addition to the 20% set aside for OS installation and operation.
The memory needed for each node can be calculated as follows:
Total memory needed = [(memory per CPU core) * (number of CPU's core)] + data node process memory + data node task tracker memory + OS memory Each data node will comprise a number of data blocks on the cluster.
As a thumb rule, it should be ensured that an increase in the number of data nodes is supported by a corresponding increase in the RAM as well.
Thus, 75 data nodes are needed in this case.
DSSA