install hadoop

Launch AWS EC2 instances/local linux instances. Assume three instances and the roles are as follows:

172.31.36.40 (master, runs Name Node, Job Tracker and Secondary name Node)

172.31.36.39 (slave)

172.31.36.38 (slave)

1. Make sure SSH has been installed on all instances. Enable password access for ease of use ...

   sudo vi /etc/ssh/sshd_config

   change "PasswordAuthentication no" to "PasswordAuthentication yes"

2. Make sure Java has been installed on all instances. if no, simply run apt-get install openjdk-6-jre (or other version)

3. Make sure a dedicated user (e.g. hduser) has been created on all instances.

4. Download a Hadoop distribution (e.g. hadoop-1.2.1.tar.gz) from Apache website to local computer.

5. Upload hadoop-1.2.1.tar.gz from local computer to all EC2 instances. Use Putty PSCP.exe

    PSCP.exe hadoop-1.2.1.tar.gz hduser@ec2-54-68-220-38.us-west-2.compute.amazonaws.com:hadoop-1.2.1.tar.gz

    the xxx.compute.amazonaws.com is the public DNS of the instance.

   if using private key for accessing the instance

    PSCP.exe -i xxx.ppk hadoop-1.2.1.tar.gz hduser@ec2-54-68-220-38.us-west-2.compute.amazonaws.com:hadoop-1.2.1.tar.gz

   the xxx.ppk is the private key

6. Extract the hadoop install file. Move the extracted folder to the usr directory. Set hduser as the owner.

    sudo tar xzf hadoop-1.2.1.tar.gz

    sudo mv hadoop-1.2.1.tar.gz /usr/local/hadoop/

    sudo chown -R hduser:hadoop hadoop/

7. Setup /etc/hosts so instances can interpret. Assume 40 is the master. 39 and 38 are the slaves.

sudo vi /etc/hosts

127.0.0.1 localhost

172.31.36.40 master

172.31.36.39 slave1

172.31.36.38 slave2

8. Add hadoop and java directories to the PATH variable in .bashrc

sudo vi /home/hduser/.bashrc

export HADOOP_HOME=/usr/local/hadoop

export JAVA_HOME=/usr/lib/jvw/java-6-openjdk-amd64

export PATH=$PATH:$HADOOP_HOME/bin

9. Create temporary folder for hadoop.

sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

10. Configure JAVA HOME in hadoop-env.sh

sudo vi /usr/local/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

11. Configure core-site.xml. Setup the temporary folder and the Name Node for HDFS. The Name Node port defaults to 54310

sudo vi /usr/local/hadoop/conf/core-site.xml

<property>

 <name>hadoop.tmp.dir</name>

 <value>/app/hadoop/tmp</value>

</property>

<property>

 <name>fs.default.name</name>

 <value>hdfs://master:54310</value>

</property>

12.  Configure the JOB TRACKER in  mapred-site.xml. This is where a Map Reduce job is submitted to.

sudo vi /usr/local/hadoop/conf/mapred-site.xml

<property>

 <name>mapred.job.tracker</name>

 <value>master:54311</value>

</property>

13. Configure parameters for HDFS in hdfs-site.xml. Assume only two copies of data is required. Normally it is 3 or more. Google figured out 3 is the best.

sudo vi /usr/local/hadoop/conf/hdfs-site.xml

<property>

 <name>dfs.replication</name>

 <value>2</value>

</property>

There are other parameters as well, e.g.

  mapred.local.dir    temp folder for mapreduce data

  mapred.map.tasks  # of task trackers, normally = 10 x # of slaves

  mapred.reduce.tasks # of reduce tasks, normally = # of task trackers x # of reduce slots per task x 0.99

14. List the instances that each runs a Secondary Name Node in /usr/local/hadoop/conf/masters

master

15. List the Salve Nodes in /usr/local/hadoop/conf/slaves. Here we use the master node to store data too, so itself serves as a slave s well.

master

slave1

slave2

16. Format HDFS from master node

 hadoop namenode -format

17. Start HDFS from master node

 start-dfs.sh

"starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-namenode-ip-172-31-36-40.out

slave2: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-38.out

slave1: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-39.out

master: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-40.out

master: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-ip-172-31-36-40.out"

18. Start Map Reduce from master node

 start-mapred.sh

"starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-jobtracker-ip-172-31-36-40.out

slave2: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-38.out

slave1: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-39.out

master: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-40.out"

Note: do the same configuration on the master and all slave instances. Step 14 to 18 are not required for slave instances. The masters and slaves files are used only the control scripts running on the namenode or job tracker.

Note2: Name Node has high memory requirments, as it holds files and block metadata for the entire HDFS.

           Secondary Name Node keeps a copy of Name Node. In case of name node corruption, you can recover from Secondary name node.

           Job tracker uses considerable memory and CPU.

           A good practice is to put Name Node, Secondary name node and Job tracker on different machines.

           The start-dfs.sh script needs to run from the name node instance, and the start-mapred.sh script needs to run from the job tracker instance.