hadoop
How to install Hadoop 2.7.1 multi-node cluster on Amazon (AWS) EC2 instance [IMPROVED] PART 1
Posted on DECEMBER 17, 2015Categories Installation
I’ve been receiving some comments on the troubles of installing Hadoop based on one of my post in this blog : How to install Hadoop 2.7.1 multi-node cluster on Amazon (AWS) EC2 instance . I decided to re-post the improvised version of the tutorial. I have just finished setting up the cluster on similar settings on Amazon EC2 instances. I have completed without any troubles or errors in 20 minutes or so depending on your Internet connection (to download Hadoop). This tutorial will also be continued in Part 2.
Launch an Amazon EC2 instance. In this tutorial, I did all the installation in a one, single instance (the master). Then, I cloned (the master instance with Hadoop already set up) and spawned two other instances (as slaves). The technical set up of the instance are as follows:
Ubuntu Server 14.04 LTS (PV)
t1.micro – free tier
8gb general purpose SSD
Defined a custom security rule to ALLOW ALL TRAFFIC from ANYWHERE* (Disclaimer: This security setup is just for testing purposes. Not in any way this approach is encouraged in live environment).
When you create the first instance, you will need to download the keypair to log in. It is file with .pem extension. When the instance is launched and ready, you can log in from the CLI using the following command:
ssh -i path_to_.pem_file ubuntu@public_IP_of_instance
Eg:
ssh -i ~/Desktop/mfaizmzaki.pem ubuntu@55.55.55.55
Create a ‘Hadoop’ group, and add the user, ‘hduser’ in the group
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser
It will prompt you to create the password for ‘hduser’. Proceed with creating the password that you wish. Then, add the created ‘hduser’ into group ‘sudo’ to allow root access with the following command
sudo adduser hduser sudo
We do not want to be using the .pem file anymore every time we are logging in. So edit the ssh configuration file to allow logging in by using the password that we have just created for ‘hduser’
sudo nano /etc/ssh/sshd_config
Find the following line and edit accordingly. You can use ctrl+W and type PasswordAuthentication to quickly find the wanted line.
PasswordAuthentication yes
Tap on ctrl+O to save the edit and ctrl+X to exit the editor. Then restart the ssh service for the update to take action.
sudo service ssh restart
Now, switch to the hduser account by issuing the command:
sudo su - hduser
*If you want to be a little bit adventurous, you can try logging in from the command line with the following command:
ssh hduser@publicIP_of_instance
Eg:
ssh hduser@55.55.55.55
You should be able to log in with the password you have created earlier.
Once in the hduser account, you are ready to start the Hadoop installation. First, we update our Ubuntu instance.
sudo apt-get update
Then, install the Java JDK. For this tutorial, I am using Java JDK 7.
sudo apt-get install openjdk-7-jdk
Once the installation is complete, we would want to create a symlink to ease our job later. Change into the following directory and issue the command:
cd /usr/lib/jvm sudo ln -s java-7-openjdk-amd64/ jdk
Next, set up the SSH key.
ssh-keygen -t rsa -P '' cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
To enable Ubuntu to register the key as known host, try SSH into localhost:
ssh localhost
You should get a prompt to save the key as known host. Once done, you can exit the ssh connection by the issuing
exit
Now that we have successfully set up the SSH key, we want to add our IP address-hostname mapping to the /etc/hosts file. This is rather an optional step as we can directly use our private IP address to connect between the nodes instead of using hostnames. But since IP address is a bit of a mess to look at, hostnames would be more preferable.
sudo nano /etc/hosts PRIVATE_IP_address hostname
Eg:
192.168.1.1 master
Remember in step 8 when we SSH into the localhost to add the key to known host? We will need to do the same for the ‘master’ that we have added to the /etc/hosts/ file as well. So issue the command:
ssh master
Accept the prompt. Once successful, you can exit the connection by typing in ‘exit’.
Next, disable IPv6 to avoid further complication. If you are curious, you can explore the IPv6 features. I personally have not explored.
sudo nano /etc/sysctl.conf
Paste the following at the end of the file and save the edit.
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
Download the latest release of Hadoop (version 2.7.1).
wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
Then, we want to extract it into the /usr/local directory. To do that:
sudo tar vxzf hadoop-2.7.1.tar.gz -C /usr/local
Change to the mentioned directory to renamed the extracted folder for easier reference:
cd /usr/local sudo mv hadoop-2.7.1 hadoop
Change the ownership of the folder accordingly:
sudo chown -R hduser:hadoop hadoop
We have successfully downloaded and extracted Hadoop. Now, we want to set up all the environment variables and Hadoop configuration. Now change back to the home directory by issuing
cd
We can now set up Hadoop environment variables in the .bashrc file
sudo nano .bashrc
Paste the following at the end of the file and save the edit
#Hadoop Variables export JAVA_HOME=/usr/lib/jvm/jdk/ export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL #End
Reload .bashrc by the following command:
source ~/.bashrc
We move on to configuring Hadoop. It is worth noting that all Hadoop configuration file are located in $HADOOP_INSTALL/etc/hadoop folder.
First, edit hadoop-env.sh
cd $HADOOP_INSTALL/etc/hadoop sudo nano hadoop-env.sh
Modify the following line
export JAVA_HOME=${JAVA_HOME}
to the following:
export JAVA_HOME=/usr/lib/jvm/jdk/
Save the edit and exit the editor.
Then, check Hadoop installation.
hadoop version
It should print out the following
Hadoop 2.7.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /usr/local/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
Okay, now edit the core-site.xml. Issue the commands below:
sudo nano core-site.xml
Paste the following between the configuration tags. Save and exit the editor
<property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property>
Next is yarn-site.xml.
sudo nano yarn-site.xml
Copy the following and paste it between the configuration tags just like before. Save and exit the editor.
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property>
To edit the mapred-site.xml, we first need to rename the template:
sudo mv mapred-site.xml.template mapred-site.xml sudo nano mapred-site.xml
Copy the following and paste it between the configuration tags just like before. Save and exit the editor.
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
The final file that we have to edit is hdfs-site.xml
sudo nano hdfs-site.xml
Copy the following and paste it between the configuration tags just like before. Save and exit the editor.
<property> <name>dfs.replication</name> <value>3</value> <description>default value is 3. It is the number of replicated files across HDFS</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hduser/mydata/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hduser/mydata/hdfs/datanode</value> </property> <property> <name>dfs.http.address</name> <value>master:50070</value> <description>Enter your Primary NameNode hostname for http access.</description> </property> <property> <name>dfs.secondary.http.address</name> <value>slave:50090</value> <description>Optional. Only if you want you sec. name node to be on specific node</description> </property>
Now that we are done, we want to add the hostnames of all the DataNodes in the slaves file. Since our Master Node (this node that we are setting up) will also serve as the DataNode, we will also insert its hostname in this file. For the time being, we do not have any other DataNode nodes other than this Master node. We will add other slave nodes once we have cloned this instance.
sudo nano slaves
Remove the line showing ‘localhost‘. Add the following and save the edit:
master
We are almost there. Now return to the home folder:
cd
Create the HDFS folders and give appropriate ownership:
sudo mkdir -p mydata/hdfs/namenode mydata/hdfs/datanode sudo chown -R hduser:hadoop mydata
We will now reformat our NameNode (this Master node is also behaving as the NameNode):
hdfs namenode -format
Alright, we are done! But, we have no slave nodes. We will clone this instance and launch our slave nodes. I will explain this in the following post: Part 2. Once we are done with that, we will launch our Hadoop Cluster.
So we have successfully set up our Hadoop for the Master node in the following post: Part 1.
To make things easier, we will now just clone the instance! To do that:
Select your Master node on the EC2 dashboard. Then, click on the Actions tab. Click Image > Create Image.
Now look on the left pane, click AMIs to view your created Image. It should be in pending status. Wait for it to be available.
Once available, select the AMI and click Launch. You will go through the same steps as when you first created the Master node. But this time, on the security groups and key-pairs window, select the existing group and keypair that you created for Master node. You can launch as many instance as you want.
Once the instance(s) are ready and available, you can log in just like you did with the Master node as it is literally the same (cloned) instance.
ssh hduser@publicIP_of_instance
Eg:
ssh hduser@66.66.66.66
You should log in using the SAME password as the ‘hduser’ on Master node (as this is the SAME/CLONED instance).
Now, add the IP address-hostname inside the /etc/hosts file like you did in Step 9 of Part 1.
sudo nano /etc/hosts
Add the PRIVATE IP of the slaves and master and the corresponding hostnames
192.168.1.1 master 192.168.1.2 slave
The IP entered above is just an EXAMPLE of IP. You should edit accordingly. Save and exit the editor.
**EDIT THE /ETC/HOSTS/ ON THE MASTER NODE AS WELL. IT SHOULD LOOK THE SAME**
On your MASTER NODE, try ssh into the slave like you did in Step 8and 10 in Part 1 to add the slaves keys into the known hosts file.
ssh slave
Accept the prompt and once successfull, exit the connection by issuing:
exit
One last thing to do before launching the cluster is to declare allthe slave nodes inside the file slaves. Revisit Step 20 in Part 1. You will need to add all slave nodes’ hostnames in the slaves file ON ALL NODES.
cd $HADOOP_INSTALL/etc/hadoop sudo nano slaves
An example of the contents of the file on all nodes would be (if you have two slave nodes)
master slave
Save and exit the editor.
You are now ready to launch your Hadoop Cluster. On your MASTER NODE, issue the following command to launch the cluster:
start-dfs.sh
and
start-yarn.sh
Alternatively, you can use the following. But it will print out command deprecated warning.
start-all.sh
Also on your MASTER NODE, start the History Server
cd $HADOOP_INSTALL/sbin mr-jobhistory-daemon.sh start historyserver
You can now check the status of your Hadoop daemons. Issue the following command on your MASTER NODE:
jps
It should display the following:
Namenode Datanode ResourceManager NodeManger HistoryServer Jps
Issue the same jps command on SLAVE NODE, it should display the following:
Datanode NodeManager SecondaryNamenode Jps
To view the Hadoop UI, they are reachable through the following:
http://PUBLIC_IP_OF_MASTER_NODE:8088 – Resource Manager
http://PUBLIC_IP_OF_NAME_NODE:50070 – NameNode
http://PUBLIC_IP_OF_DATA_NODE:50075 – DataNode
http://PUBLIC_IP_OF_MASTER_NODE:19888 – History Server
**If you want to add extra nodes while the cluster is live and running, just spawn a new instance like you did earlier by repeating Step 2 – Step 7. Then, on the NEW NODE, start up DataNode and NodeManager daemon by issuing:
cd $HADOOP_INSTALL/sbin hadoop-daemon.sh start datanode
and
yarn-daemon.sh start nodemanager