hadoop

How to install Hadoop 2.7.1 multi-node cluster on Amazon (AWS) EC2 instance [IMPROVED] PART 1

Posted on DECEMBER 17, 2015Categories Installation

I’ve been receiving some comments on the troubles of installing Hadoop based on one of my post in this blog : How to install Hadoop 2.7.1 multi-node cluster on Amazon (AWS) EC2 instance . I decided to re-post the improvised version of the tutorial. I have just finished setting up the cluster on similar settings on Amazon EC2 instances. I have completed without any troubles or errors in 20 minutes or so depending on your Internet connection (to download Hadoop). This tutorial will also be continued in Part 2.

    1. Launch an Amazon EC2 instance. In this tutorial, I did all the installation in a one, single instance (the master). Then, I cloned (the master instance with Hadoop already set up) and spawned two other instances (as slaves). The technical set up of the instance are as follows:

    • Ubuntu Server 14.04 LTS (PV)

    • t1.micro – free tier

    • 8gb general purpose SSD

    • Defined a custom security rule to ALLOW ALL TRAFFIC from ANYWHERE* (Disclaimer: This security setup is just for testing purposes. Not in any way this approach is encouraged in live environment).

    1. When you create the first instance, you will need to download the keypair to log in. It is file with .pem extension. When the instance is launched and ready, you can log in from the CLI using the following command:

ssh -i path_to_.pem_file ubuntu@public_IP_of_instance

Eg:

ssh -i ~/Desktop/mfaizmzaki.pem ubuntu@55.55.55.55

    1. Create a ‘Hadoop’ group, and add the user, ‘hduser’ in the group

sudo addgroup hadoop sudo adduser --ingroup hadoop hduser

It will prompt you to create the password for ‘hduser’. Proceed with creating the password that you wish. Then, add the created ‘hduser’ into group ‘sudo’ to allow root access with the following command

sudo adduser hduser sudo

    1. We do not want to be using the .pem file anymore every time we are logging in. So edit the ssh configuration file to allow logging in by using the password that we have just created for ‘hduser’

sudo nano /etc/ssh/sshd_config

Find the following line and edit accordingly. You can use ctrl+W and type PasswordAuthentication to quickly find the wanted line.

PasswordAuthentication yes

Tap on ctrl+O to save the edit and ctrl+X to exit the editor. Then restart the ssh service for the update to take action.

sudo service ssh restart

    1. Now, switch to the hduser account by issuing the command:

sudo su - hduser

*If you want to be a little bit adventurous, you can try logging in from the command line with the following command:

ssh hduser@publicIP_of_instance

Eg:

ssh hduser@55.55.55.55

You should be able to log in with the password you have created earlier.

    1. Once in the hduser account, you are ready to start the Hadoop installation. First, we update our Ubuntu instance.

sudo apt-get update

    1. Then, install the Java JDK. For this tutorial, I am using Java JDK 7.

sudo apt-get install openjdk-7-jdk

Once the installation is complete, we would want to create a symlink to ease our job later. Change into the following directory and issue the command:

cd /usr/lib/jvm sudo ln -s java-7-openjdk-amd64/ jdk

    1. Next, set up the SSH key.

ssh-keygen -t rsa -P '' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

To enable Ubuntu to register the key as known host, try SSH into localhost:

ssh localhost

You should get a prompt to save the key as known host. Once done, you can exit the ssh connection by the issuing

exit

    1. Now that we have successfully set up the SSH key, we want to add our IP address-hostname mapping to the /etc/hosts file. This is rather an optional step as we can directly use our private IP address to connect between the nodes instead of using hostnames. But since IP address is a bit of a mess to look at, hostnames would be more preferable.

sudo nano /etc/hosts PRIVATE_IP_address hostname

Eg:

192.168.1.1 master

    1. Remember in step 8 when we SSH into the localhost to add the key to known host? We will need to do the same for the ‘master’ that we have added to the /etc/hosts/ file as well. So issue the command:

ssh master

Accept the prompt. Once successful, you can exit the connection by typing in ‘exit’.

    1. Next, disable IPv6 to avoid further complication. If you are curious, you can explore the IPv6 features. I personally have not explored.

sudo nano /etc/sysctl.conf

Paste the following at the end of the file and save the edit.

#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

    1. Download the latest release of Hadoop (version 2.7.1).

wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

Then, we want to extract it into the /usr/local directory. To do that:

sudo tar vxzf hadoop-2.7.1.tar.gz -C /usr/local

Change to the mentioned directory to renamed the extracted folder for easier reference:

cd /usr/local sudo mv hadoop-2.7.1 hadoop

Change the ownership of the folder accordingly:

sudo chown -R hduser:hadoop hadoop

    1. We have successfully downloaded and extracted Hadoop. Now, we want to set up all the environment variables and Hadoop configuration. Now change back to the home directory by issuing

cd

We can now set up Hadoop environment variables in the .bashrc file

sudo nano .bashrc

Paste the following at the end of the file and save the edit

#Hadoop Variables export JAVA_HOME=/usr/lib/jvm/jdk/ export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL #End

Reload .bashrc by the following command:

source ~/.bashrc

    1. We move on to configuring Hadoop. It is worth noting that all Hadoop configuration file are located in $HADOOP_INSTALL/etc/hadoop folder.

First, edit hadoop-env.sh

cd $HADOOP_INSTALL/etc/hadoop sudo nano hadoop-env.sh

Modify the following line

export JAVA_HOME=${JAVA_HOME}

to the following:

export JAVA_HOME=/usr/lib/jvm/jdk/

Save the edit and exit the editor.

    1. Then, check Hadoop installation.

hadoop version

It should print out the following

Hadoop 2.7.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /usr/local/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar

    1. Okay, now edit the core-site.xml. Issue the commands below:

sudo nano core-site.xml

Paste the following between the configuration tags. Save and exit the editor

<property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property>

    1. Next is yarn-site.xml.

sudo nano yarn-site.xml

Copy the following and paste it between the configuration tags just like before. Save and exit the editor.

<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property>

    1. To edit the mapred-site.xml, we first need to rename the template:

sudo mv mapred-site.xml.template mapred-site.xml sudo nano mapred-site.xml

Copy the following and paste it between the configuration tags just like before. Save and exit the editor.

<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>

    1. The final file that we have to edit is hdfs-site.xml

sudo nano hdfs-site.xml

Copy the following and paste it between the configuration tags just like before. Save and exit the editor.

<property> <name>dfs.replication</name> <value>3</value> <description>default value is 3. It is the number of replicated files across HDFS</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hduser/mydata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hduser/mydata/hdfs/datanode</value> </property> <property> <name>dfs.http.address</name> <value>master:50070</value> <description>Enter your Primary NameNode hostname for http access.</description> </property> <property> <name>dfs.secondary.http.address</name> <value>slave:50090</value> <description>Optional. Only if you want you sec. name node to be on specific node</description> </property>

    1. Now that we are done, we want to add the hostnames of all the DataNodes in the slaves file. Since our Master Node (this node that we are setting up) will also serve as the DataNode, we will also insert its hostname in this file. For the time being, we do not have any other DataNode nodes other than this Master node. We will add other slave nodes once we have cloned this instance.

sudo nano slaves

Remove the line showing ‘localhost‘. Add the following and save the edit:

master

    1. We are almost there. Now return to the home folder:

cd

Create the HDFS folders and give appropriate ownership:

sudo mkdir -p mydata/hdfs/namenode mydata/hdfs/datanode sudo chown -R hduser:hadoop mydata

    1. We will now reformat our NameNode (this Master node is also behaving as the NameNode):

hdfs namenode -format

Alright, we are done! But, we have no slave nodes. We will clone this instance and launch our slave nodes. I will explain this in the following post: Part 2. Once we are done with that, we will launch our Hadoop Cluster.

So we have successfully set up our Hadoop for the Master node in the following post: Part 1.

To make things easier, we will now just clone the instance! To do that:

    1. Select your Master node on the EC2 dashboard. Then, click on the Actions tab. Click Image > Create Image.

    1. Now look on the left pane, click AMIs to view your created Image. It should be in pending status. Wait for it to be available.

    1. Once available, select the AMI and click Launch. You will go through the same steps as when you first created the Master node. But this time, on the security groups and key-pairs window, select the existing group and keypair that you created for Master node. You can launch as many instance as you want.

    2. Once the instance(s) are ready and available, you can log in just like you did with the Master node as it is literally the same (cloned) instance.

ssh hduser@publicIP_of_instance

Eg:

ssh hduser@66.66.66.66

You should log in using the SAME password as the ‘hduser’ on Master node (as this is the SAME/CLONED instance).

    1. Now, add the IP address-hostname inside the /etc/hosts file like you did in Step 9 of Part 1.

sudo nano /etc/hosts

Add the PRIVATE IP of the slaves and master and the corresponding hostnames

192.168.1.1 master 192.168.1.2 slave

The IP entered above is just an EXAMPLE of IP. You should edit accordingly. Save and exit the editor.

**EDIT THE /ETC/HOSTS/ ON THE MASTER NODE AS WELL. IT SHOULD LOOK THE SAME**

    1. On your MASTER NODE, try ssh into the slave like you did in Step 8and 10 in Part 1 to add the slaves keys into the known hosts file.

ssh slave

Accept the prompt and once successfull, exit the connection by issuing:

exit

    1. One last thing to do before launching the cluster is to declare allthe slave nodes inside the file slaves. Revisit Step 20 in Part 1. You will need to add all slave nodes’ hostnames in the slaves file ON ALL NODES.

cd $HADOOP_INSTALL/etc/hadoop sudo nano slaves

An example of the contents of the file on all nodes would be (if you have two slave nodes)

master slave

Save and exit the editor.

    1. You are now ready to launch your Hadoop Cluster. On your MASTER NODE, issue the following command to launch the cluster:

start-dfs.sh

and

start-yarn.sh

Alternatively, you can use the following. But it will print out command deprecated warning.

start-all.sh

    1. Also on your MASTER NODE, start the History Server

cd $HADOOP_INSTALL/sbin mr-jobhistory-daemon.sh start historyserver

    1. You can now check the status of your Hadoop daemons. Issue the following command on your MASTER NODE:

jps

It should display the following:

Namenode Datanode ResourceManager NodeManger HistoryServer Jps

Issue the same jps command on SLAVE NODE, it should display the following:

Datanode NodeManager SecondaryNamenode Jps

    1. To view the Hadoop UI, they are reachable through the following:

    • http://PUBLIC_IP_OF_MASTER_NODE:8088 – Resource Manager

    • http://PUBLIC_IP_OF_NAME_NODE:50070 – NameNode

    • http://PUBLIC_IP_OF_DATA_NODE:50075 – DataNode

    • http://PUBLIC_IP_OF_MASTER_NODE:19888 – History Server

    1. **If you want to add extra nodes while the cluster is live and running, just spawn a new instance like you did earlier by repeating Step 2 – Step 7. Then, on the NEW NODE, start up DataNode and NodeManager daemon by issuing:

cd $HADOOP_INSTALL/sbin hadoop-daemon.sh start datanode

and

yarn-daemon.sh start nodemanager