Hadoop Cluster

Post date: Feb 09, 2019 1:15:22 PM

Hadoop Clusters, two main terms come up, they are cluster and node, so on defining them:

Hadoop clusters have two types of machines, such as Master  and Slave, where:

Projecting Required Big Data Capacity

We start with 1 TB of daily data from Year 1 and assume 15% data growth per quarter. Further, assuming a 15% year on-year growth in data volumes and 1,080 TB of data in Year 1, by the end of Year 5 the capacity may grow to 8,295 TB of data. If we were to assume a 30% year-on-year growth in data volumes and 1080 TB of data in Year 1, then by the end of Year 5, the capacity might grow to 50,598 TB of data. The following formula can be used to estimate Hadoop storage and arrive at the required number of data nodes:

Hadoop Storage (H) = C*R*S/(1-i)

Legend

Estimating Required Hadoop Storage and Number of Data Nodes

With no compression, C equals 1. The replication factor is assumed to be 3 and the intermediate factor 0.25 or ¼. The calculation for H in this case becomes:

H= 1*3*S/(1-(1/4)) = 3*S/(3/4) = 4*S

 

      Master

 

      Slaves

 

Notes  Hortonworks recommends separating master and slave nodes because: • Task/application workloads on the slave nodes should be isolated from the masters. • Slaves nodes are frequently decommissioned for maintenance

Node Configuration :-> Two types of important configuration files, Hadoop’s Java configuration is driven:

Read-only default configuration

 Site-specific configuration :

The required Hadoop storage in this instance is estimated to be four times the initial data size. The following formula can be used to estimate the number of data nodes:

(n) = H/D = C*R*S/(1-i)/D

Hadoop daemons execute and also the configuration parameters for the Hadoop daemons, in order to configure the Hadoop cluster.

D: Disk space available per node Let us assume that 8 TB is the available disk space per node, each node comprising 10 disks of 1 TB capacity each, minus 2 disks for operating system. Also, assuming the initial data size to be 600 TB:

 HDFS daemons

 YARN daemons

 NameNode, and DataNode

 ResourceManager, and NodeManager

N = 600/8 = 75

Notes If complex processing is anticipated, then it is recommended to have at least 10% additional vacant space to accommodate such processing. This 10% is an addition to the 20% set aside for OS installation and operation.

The memory needed for each node can be calculated as follows:

Total memory needed = [(memory per CPU core) * (number of CPU's core)] + data node process memory + data node task tracker memory + OS memory Each data node will comprise a number of data blocks on the cluster.

As a thumb rule, it should be ensured that an increase in the number of data nodes is supported by a corresponding increase in the RAM as well.

Thus, 75 data nodes are needed in this case.

DSSA