install hadoop
Launch AWS EC2 instances/local linux instances. Assume three instances and the roles are as follows:
172.31.36.40 (master, runs Name Node, Job Tracker and Secondary name Node)
172.31.36.39 (slave)
172.31.36.38 (slave)
1. Make sure SSH has been installed on all instances. Enable password access for ease of use ...
sudo vi /etc/ssh/sshd_config
change "PasswordAuthentication no" to "PasswordAuthentication yes"
2. Make sure Java has been installed on all instances. if no, simply run apt-get install openjdk-6-jre (or other version)
3. Make sure a dedicated user (e.g. hduser) has been created on all instances.
4. Download a Hadoop distribution (e.g. hadoop-1.2.1.tar.gz) from Apache website to local computer.
5. Upload hadoop-1.2.1.tar.gz from local computer to all EC2 instances. Use Putty PSCP.exe
PSCP.exe hadoop-1.2.1.tar.gz hduser@ec2-54-68-220-38.us-west-2.compute.amazonaws.com:hadoop-1.2.1.tar.gz
the xxx.compute.amazonaws.com is the public DNS of the instance.
if using private key for accessing the instance
PSCP.exe -i xxx.ppk hadoop-1.2.1.tar.gz hduser@ec2-54-68-220-38.us-west-2.compute.amazonaws.com:hadoop-1.2.1.tar.gz
the xxx.ppk is the private key
6. Extract the hadoop install file. Move the extracted folder to the usr directory. Set hduser as the owner.
sudo tar xzf hadoop-1.2.1.tar.gz
sudo mv hadoop-1.2.1.tar.gz /usr/local/hadoop/
sudo chown -R hduser:hadoop hadoop/
7. Setup /etc/hosts so instances can interpret. Assume 40 is the master. 39 and 38 are the slaves.
sudo vi /etc/hosts
127.0.0.1 localhost
172.31.36.40 master
172.31.36.39 slave1
172.31.36.38 slave2
8. Add hadoop and java directories to the PATH variable in .bashrc
sudo vi /home/hduser/.bashrc
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvw/java-6-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
9. Create temporary folder for hadoop.
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
10. Configure JAVA HOME in hadoop-env.sh
sudo vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
11. Configure core-site.xml. Setup the temporary folder and the Name Node for HDFS. The Name Node port defaults to 54310
sudo vi /usr/local/hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
</property>
12. Configure the JOB TRACKER in mapred-site.xml. This is where a Map Reduce job is submitted to.
sudo vi /usr/local/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
</property>
13. Configure parameters for HDFS in hdfs-site.xml. Assume only two copies of data is required. Normally it is 3 or more. Google figured out 3 is the best.
sudo vi /usr/local/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
There are other parameters as well, e.g.
mapred.local.dir temp folder for mapreduce data
mapred.map.tasks # of task trackers, normally = 10 x # of slaves
mapred.reduce.tasks # of reduce tasks, normally = # of task trackers x # of reduce slots per task x 0.99
14. List the instances that each runs a Secondary Name Node in /usr/local/hadoop/conf/masters
master
15. List the Salve Nodes in /usr/local/hadoop/conf/slaves. Here we use the master node to store data too, so itself serves as a slave s well.
master
slave1
slave2
16. Format HDFS from master node
hadoop namenode -format
17. Start HDFS from master node
start-dfs.sh
"starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-namenode-ip-172-31-36-40.out
slave2: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-38.out
slave1: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-39.out
master: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-ip-172-31-36-40.out
master: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-ip-172-31-36-40.out"
18. Start Map Reduce from master node
start-mapred.sh
"starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-jobtracker-ip-172-31-36-40.out
slave2: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-38.out
slave1: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-39.out
master: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ip-172-31-36-40.out"
Note: do the same configuration on the master and all slave instances. Step 14 to 18 are not required for slave instances. The masters and slaves files are used only the control scripts running on the namenode or job tracker.
Note2: Name Node has high memory requirments, as it holds files and block metadata for the entire HDFS.
Secondary Name Node keeps a copy of Name Node. In case of name node corruption, you can recover from Secondary name node.
Job tracker uses considerable memory and CPU.
A good practice is to put Name Node, Secondary name node and Job tracker on different machines.
The start-dfs.sh script needs to run from the name node instance, and the start-mapred.sh script needs to run from the job tracker instance.