How To Install Hortonworks HDP 1.1

            
Contents:

Verified Install: 1-15-2013
New: setup4hadoop.bash script file automates all of the preparation steps for installing Hadoop below! 
  • Download the setup4hadoop.bash to the head node of your cluster and execute with the remaining nodes public address (up to 3) as command line arguments.
  • When finished (and there are no errors) start on Part 3) above




 TypeDetails  Reasoning
OS  CentOS 5.8CentOS is widely available and supported across many enterprises. 
Hadoop Hortonworks - HDP 1.1.1.16-1.el5 
Hortonworks - HDP 1.0.1.14-1.el5
Hortonworks is an open-source distribution of Hadoop with open-source management tools staying aligned to and covering the Apache Hadoop software stack. 

Chosen distribution over pure Apache Hadoop to help better define requirements (OS, tested/supported releases) and ability to quickly deploy (automated installation).
H/WAmazon EC2
4 x Small Instances - AMI-4c62c025 (RightScale-CentOS 5.8) ~ 40 minutes
or
4 x Large Instances (M1) - AMI-4c62c025 (RightScale-CentOS 5.8) ~ 20 minutes
Note: Small instances may be too 'small' for running jobs, but serves well to demonstrate an installation.
Security Groups - 0.0.0.0/00 (all IPs & Ports enabled)
Key Pairs - xyz123.pem
 
Purpose is to quickly deploy a Hadoop cluster for Test & Development.

Low-cost, widely available, easily deployed, repeatable infrastructure.

Use medium-to-large size instances if you are going to do anything beyond the Hadoop installation.
 Support
Files
 Full list of installed files and version can be found here.

Installation Files
Attached with this installation documentation you will find all of the installation, configuration (/etc) and supporting log (/var/log) files downloaded from the nodes when the above installation took place (9-2012). This will allow you to troubleshoot or analyze the installation exactly utilizing the original files.
 ClientWindows 7
Putty , PSCP , PuttyGen
Chrome Web Browser
 
Greater distribution to audience using enterprise standard Windows
client machines to execute installation.







 HDP ForumHDP Installation Frequently Asked Questions (FAQ)
HDP Installation Common Issues
HDP Installation - Support
 
Puppet
Puppet Kick Fail
Puppet Certificate Issues
 RHEL/CentOS 6.XAlways check forums when installing on the latest supported OS. HDP supports both RHEL/CentOS 5.X & 6.X so if you can go with 5.X do it. I've seen many installs on CentOS 5.7 and 5.8.
 
9-3-2012 - Couple of potential issues found, discussed in HDP Installation forum here 
 If all else fails... Keep doing research, it can be done. Here are some potential configuration issues:
  • No network configuration, IP address, Ethernet device undefined
  • No DNS server or Reverse DNS capability
  • No software repositories defined, or incorrectly defined
  • Incorrect run control level / services not installed, or running, correctly
  • Permissions, Installing with users other than root
  • Software conflicts, old software versions or conflicting software needs to be removed.
  • Before starting installation reboot the nodes! this can bring out some errors that are not easily seen.







How To Install Hortonworks HDP  
~34 Minutes
Setting Up EC2 (0:00-5:56)
Setting Up Linux (5:57-19:21)
Installing Hadoop (19:22-33:49) 




Step-By-Step Guides







Configuration Details
 EC2 Details
  • Number of instances: 4
  • AMI Instance: ami-4c62c025
  • Instance size: Small
  • Kernel ID + RAM Disk ID: use default
  • Instance Details: use default
  • Keypair: download .pem file to client machine
  • Security Group: create new group with 'ALL TCP' enabled (0.0.0.0/0)
 Required Programs
  •  Putty - used to connect to Amazon EC2 instances with client machine
    • default user = root
    • authorized = .ppk file (generated by PuttyGen from downloaded .pem file)
  • PuttyGen - use with downloaded .pem file to create .ppk file.
  • pscp - putty transfer utility. Used through windows command line.
    • syntax: ~/> pscp -i abc123.ppk filename root@public-addr:/path/


Sign-up for Amazon EC2

o   Http://aws.amazon.com


1.1 Deploy Instances

·         Go-to AWS Management Console > EC2 Services > Click “Launch Instance

·         Choose the Classic Wizard: Select the Community AMIs tab > Select a AMI with CentOS 5.x

§  Recommended AMI: 

AMI-4c62c025 (RightScale-CentOS 5.8)

o   Select 4 instances Small of type Launch Instance (instead of spot)

o   Kernel ID + RAM Disk ID = Use Default

o   Instance Details = Use DefaultCreate New Key Pair

§  1 key pair used for all instances named: hdp-privkey1.pem

§  Download hdp-privkey1.pem (DO NOT LOSE THIS)

o   DON’T FORGET: download .pem file to your local machine when creating a new key pair (hdp-privkey1.pem).

·         The .pem (private key) file allows your client machine to connect to the running Amazon EC2 instance through SSH.

·         If you lose the .pem you will need to re-create the instance, Amazon doesn’t store this file because of security reasons.

o   However you can stop, snapshot, and re-create a new instance based on this one so you don’t lose your configuration (data should be on Amazon S3)


Create New Security Group

§  1 security group for all 4 instances:

·         Name group = hdp-sg1

·         Description = security port settings for 4 node cluster HDP installation

·         Add the rule for ALL TCP (0.0.0.0/0) and then save


o   LAUNCH

§  Check status

·         Click on ‘Running Instances

·         Make sure state for each server = ‘running

·         Make sure Status check has  green checkmark

 

1.2 Connect To EC2 instances from Windows

Multiple Methods:

o   NEW: Amazon SSH Client (through browser)

o   Go to AWS Management Console > EC2 Dashboard

§  > Running Instances

§  > Select instance, click ‘Instance Actions’

§  > Click Connect

 

o   PreferredPutty

o   Required: download PuttyGen

o   Required: download Putty –SSH client

§  DOWNLOAD HERE: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

o   Create .ppk file for Putty SSH client

§  Open PuttyGen, and Click Conversions > Import Key

§  Navigate and select hdp-privkey1.pem that was created in previous steps.

§  Click Save, no passphrase, as: hdp-privkey1.ppk

o   Get the IP for the EC2 instance

§  Go to Amazon AWS, EC2 Services, Click Running Instances

§  Select the instance you want to connect to, and a button at the top called ‘instance actions’ becomes visible, click connect and copy the public IP address for the instance:

o   Connect to EC2 master instance with Putty

§  Open Putty; enter the IP address in the Host Name field.

§  In the category tree to the left, select Connections > Data

·         In the Auto-login username field put ‘root’

§  In the category tree to the left, select Connections  > SSH > Auth

§  Under Authentication Parameters

·         Private Key File For Authentication

o   Hit browse, and select hdp-privkey1.ppk created in the last step

§  To Save the configuration settings

·         Go back to Session in the category tree to the left

·         Type a name for this configuration under Saved Sessions, I called mine “master-hadoop-ec2”

§  Click Open to connect to the EC2 instance.











Configuration Details (see below this section for full history of commands with descriptions)
 /etc/hosts Must be configured.

The /etc/hosts file on each node of your cluster must be configured with an entry for localhost,
 and an entry for each of the nodes in your cluster. The /etc/hosts format is one entry per line
 with each data value separated by a tab, there are 3 values: 
   [IP address \t  FQDN=HostName.DomainName \t  Alias] 

To determine your IP address and FQDN on each node execute the following:

 # hostname -i
   [ returns the IP address of the node you are currently on ]
 # hostname -f
   [ returns the FQDN (or hostname.domainname) of the node you are on ]

After setting up your hosts file, it should look something like the following (albeit the IP addresses and domain names will be different:

sample /etc/hosts:
127.0.0.1            localhost.localdomain        localhost
10.214.138.38     domU-12-31-39-0B-85-D8.computer-1.internal        h1
10.190.42.215    ip-10-190-42-215.ec2.internal    n1
10.72.127.132    ip-10-72-127-132.ec2.internal    n2
10.211.35.217    domU-12-31-39-0A-20-2B.computer-1.internal    n3   
 
 ssh / sshd Password-less SSH must be enabled on all nodes.

1) The first step is to make sure that the PrivateKey file from your Public/Private KeyPair you created (and downloaded) when
deploying the nodes through the EC2 'Launch Instance' wizard is on every node in your cluster. If you downloaded this file
(xyz123.pem or whatever you named the file, but ending in .pem) to your local client Windows machine then we can use the
PSCP program to upload the file to each of the nodes in your cluster. You will want to upload the .pem file to /root/.ssh 
directory on each node in your cluster, and rename the file from xyz123.pem to id_rsa. Once you have done this you will need
to set permission on each of the nodes by executing the following command:

 # chmod 700 /root/.ssh ; chmod 640 /root/.ssh/authorized_keys ; chmod 600 /root/.ssh/id_rsa
This will set permission such that: 
   root (and no one else) has Read/Write/Execute to directory /root/.ssh
   file /root/.ssh/authorized_keys can only be Read/Write by root, and Read by the group it belongs to
   file /root/.ssh/id_rsa can only be Read/Write to by root

2) Next we want to make sure that every time we log into one of the nodes in our cluster we are not asked to authorize. To do
this we will change the ssh configuration file and then restart the ssh daemon. Execute the following command:
 # sed -i 's/^.*StrictHostKeyChecking.*$/StrictHostKeyChecking=no/' /etc/ssh/ssh_config ; service sshd restart

Once password-less SSH has been setup correctly, you should be able to jump around from host to host fairly quickly, to test
this out, try the following from the head node (h1):

 # ssh n1
 # ssh n2
 # ssh n3

Don't forget to exit out of each session!

 yum / rpm  Must be enabled and configured properly.

Make sure and use the standard CentOS-Base.repo file for your repository which should be included in your /etc/yum.repos.d/
directory.

~~ Repositories ~~

One of the most important parts of a correct setup is making sure you have the software repositories setup correctly on your
machine. To start lets backup all the repos the machine is currently using, and then install our own. To backup the repos that
our currently on the node, execute the following:

 # cd /etc/yum.repos.d/ ; for b6 in `ls` ; do mv $b6 ${b6}.bak ; done

Once you have renamed all of the repos that yum is currently using (and saved them) use the following CentOS 5 base repo
file that can be found here by executing the following command while still in the /etc/yum.repos.d/ directory and copying and
inserting the text:

 # vi CentOS-Base.repo


Also make sure you install the Hortonworks HDP repository:
HDP 1.0.1.14 Repository (rpm file to instal hdp.repo in /etc/yum.repos.d/ , as of 9-2012 , check Hortonworks.com for latest)

 rpm -Uvh http://public-repo-1.hortonworks.com/HDP-1.1.1.16/repos/centos5/hdp-release-1.1.1.16-1.el5.noarch.rpm


~~ Existing Software ~~

Check the HDP Installation Documentation for a current list of software, however you need to check and remove any existing
software that may conflict with the installation files, at the time of this writing (9-2012) I used the following command to check:

 # rpm -qa | grep -ie ruby -ie puppet -ie passenger -ie nagios -ie mysql -ie ganglia -ie rrdtool 

For example in my release ruby and rrdtool were installed so I executed the following command before installing HDP:

 # yum erase -y ruby* rrdtool* 
 [ the -y option will answer yes and push through the uninstall. When HDP gets installed all dependencies will come with it. ]

~~ Necessary Software ~~

 # rpm -qa | grep -ie yum -ie rpm -ie scp -ie curl -ie wget -ie pdsh
   [ scp won't come back but executing the command from the terminal should show its there ]
   [ pdsh does not exist as well and can be downloaded from the HDP repository or must be compiled ]
 
If you are having issues with your yum program and repository you can refresh/clean the database using the following:
 # rm -rf /var/lib/rpm/__db*
 # rpm --rebuilddb
 # yum clean all
 # yum update
 # yum-complete-transaction

 DNS / Reverse DNS Must be enabled and configured properly.

Both the ability for DNS lookups and reverse DNS lookups need to be enabled on the machine, and making sure the hosts file
is setup correctly is a great step towards that end. If you are having issues with DNS also trying to check /etc/resolv.conf for the
DNS server that is defined is a good idea.

To check that DNS / Reverse DNS lookups are working properly execute any/all the following:

 # host `hostname -f`
   [ Will execute the host command against the FQDN which should return the IP address if working correctly ]
 # host `hostname -i`
   [ Will execute the host command against the IP address which should return the FQDN if working correctly ]
 # dig amazon.com
   [ Should resolve and provide the DNS server used along with corresponding IPs ]



Command History & Descriptions


Requirements:
  • Followed steps in last guide "How to Setup EC2"
  • Windows client with private key file, .ppk generated by PuttyGen, Putty SSH Terminal, and PSCP file transfer utility, which can be DOWNLOADED HERE
  • 4 running instances on Amazon EC2, minimum size small, of AMI= ami-4c62c025
  • Putty ssh terminal configured to connect to the head node
 

  1. Open Chrome
  2. Navigate to aws.amazon.com
  3. Login to AWS Management Console
  4. Click on EC2 (dashboard)
  5. Click 'Running Instances'
    >   you should have 4 running instances dedicated to this installation which was created in the last guide "How to Setup EC2"

  1. Click the check mark next to the first of the 4 instances (we will designate this as the head node) 
  2. Click 'Instance Actions'
  3. Click 'Connect'
    > A 'Connect to an instance' box should popup. Copy the Public address, ours was ec2-23-22-223-4.compute-1.amazonaws.com. Save to notepad and name h1
    > Repeat the above steps for the remaining 3 instances > Save each instance public address to the note file from above labeling each n1, n2, n2 (node1-3) in that order

  1. Run the putty client program
  2. Connect to the head node h1
    > Setting up putty to connect to an EC2 instance is covered in the last guide "How to Setup EC2"
 





  1. Open Windows Start
  2. Run: cmd (command line utility)
  3. Navigate to where the hdp-privkey1.pem and hdp-privkey1.pem are located, for us it was c:\ (root)
  4. Make sure the pscp.exe (putty scp) file is downloaded to this directory or its path is part of the windows global variable so it can be executed from any directory

EXECUTE Windows Client:
  c:\ pscp -i hdp-privkey1.ppk hdp-privkey1.pem root@ec2-23-22-223-4.compute-1.amazonaws.com:/root/.ssh/id_rsa
   > This will use the putty scp tool to upload your .pem (private key file) to the head node and locate it at /root/.ssh renaming the file to id_rsa






   > Set permissions on the head node for using the private key file


EXECUTE H1:
 # ls /root/.ssh
 # chmod 700 /root/.ssh ; chmod 640 /root/.ssh/authorized_keys ; chmod 600 /root/.ssh/id_rsa
   > Set /root/.ssh directory to owner=execute ; set public key file 'authorized keys' to owner=read+execute,group=read ; and set private key file 'id_rsa' to   
       owner=read+execute



   > Gather the network information for the head node
 
EXECUTE H1:
 # echo -e "`hostname -i`\t`hostname -f`\th1"
   > print to stdout the IP address, Private address, h1 (host alias) using echo with special character '\t' to delimet the values with a tab character.
   > save this line for the head node h1 to the previously created notes file as it will be used later to populate the /etc/hosts file of each instance
   > You should be able to ssh from the head node h1 into the other instances (n1,n2,n3) now that we have uploaded the .pem as id_rsa and set permissions
 




EXECUTE H1:
 # ssh ec1-204-236-204-192.compute-1.amazonaws.com
 # [TYPE NO TO AUTHENTICATION REQUEST]
   > Get the public address for node1/n1 that we copied in the previous steps and connect to this instance through the head node using SSH
   > You will notice the ssh tool asking you to authenticate the host, since this is T&D lets remove this message so we can ssh automatically into all of the nodes
 
EXECUTE H1:
 # sed -i 's/^.*StrictHostKeyChecking.*$/StrictHostKeyChecking=no/' /etc/ssh/ssh_config ; service sshd restart
   > This will use the sed command to search and replace from the /etc/ssh/ssh_config file the first occurrence of the variable StrictHostKeyChecking and set it equal to 'no'
   > The next command will restart the ssh daemon such that it can pick up this configuration file change
 # ssh ec1-204-236-204-192.compute-1.amazonaws.com
   > you should now be logged into node1/n1 through its public DNS address lets grab its network information including IP address, Private DNS address, and name it n1
 # echo -e "`hostname -i`\t`hostname -f`\tn1"
    > save this line for node1/n1 to the previously created notes file
 

 
 # exit
   > exit your ssh connection to n1 and return to the head node h1

EXECUTE N1+N2+N3:
   > REPEAT: the steps above for ssh into n2 and n3 using their public DNS address. Grab their network information for IP address, Private DNS address, and don't forget to change the alias to 
      n2, n3




EXECUTE H1:
 # vi /etc/hosts
 # KEYBOARD INPUT=lowercase: o
   > Setup your hosts file for the head node h1
   > VI lowercase o: Go into insert mode and add new line below current line
   > Paste the network configuration for h1,n1,n2,n3 in the format of one host per line, with IP\tPrivateDNS\talias format
 # KEYBOARD INPUT=escape
   > VI: Exit insert mode
 # KEYBOARD INPUT=capital: ZZ
   > VI: Save and exit

   > You should now be able to ssh into the alias of each machine without having to re-copy the public/private dns address each time, try it:
 # ssh n1
 # exit
 
EXECUTE N1+N2+N3:
   > REPEAT: the steps above to ssh into each node n1,n2,n3 and vi into each instances /etc/hosts file to add the 4 lines for each host in the cluster



   > Now we will transfer the id_rsa file from the head node to each instance

EXECUTE H1:
  # scp /root/.ssh/id_rsa n1:/root/.ssh ; scp /root/.ssh/id_rsa n2:/root/.ssh ; scp /root/.ssh/id_rsa n3:/root/.ssh
   > Setup nodes n1,n2,n3 to be ready for password-less ssh. They will need the private key file to do this. Once we install pdsh we'll start executing commands against all 3 nodes.
  
   > Let's make sure we have the standard CentOS mirrorlist of repositories for yum to pull from. We are going to want to install pdsh so lets grab the HDP repo as well.

EXECUTE H1:
 # cd /etc/yum.repos.d/ ; for b in `ls` ; do mv $b ${b}.bak ; done
   > Navigate to the directory where yum repositories are defined > Create a conditional loop to rename every file in this directory to have a .bak extension
   > Renaming a .repo file to .repo.bak will effectively make it so that yum does not recognize this file as a repository to include in installations
 # ls
   > verify the command executed properly
   > Install the CentOS mirror list of repositories
 # vi /etc/yum.repos.d/CentOS-Base.repo
   > Creates the CentOS repository file, press 'i' to go into insert mode, insert the contents FOUND HERE, press ESC, then (capital) ZZ to save and exit
 # rpm -Uvh http://public-repo-1.hortonworks.com/HDP-1.1.1.16/repos/centos5/hdp-release-1.1.1.16-1.el5.noarch.rpm

   > Install the Hortonworks Data Platform (HDP) repository file  






EXECUTE H1:
 # yum install -y pdsh
   > Install pdsh (from HDP repo) and automatically answer yes to installation questions
 # vi /etc/pdsh/machines
   > Add the private DNS address for n1,n2,n3 here, one per line with no leading or trailing line or spaces. Do not include the head node H1 as we are executing on that node
 # pdsh -a whoami
   > execute the command 'whoami' across all nodes defined in /etc/pdsh/machines file (in our case n1,n2,n3), should return with root for each machine, and you now have access to all






EXECUTE H1:
 # CTRL + R chmod
   > On the keyboard execute CTRL + R which will run the bash reverse history search. Type in 'chmod' to find the set of permission commands used to set ssh permissions on the head node
 # pdsh -a "chmod 700 /root/.ssh ; chmod 640 /root/.ssh/authorized_keys ; chmod 600 /root/.ssh/id_rsa"
   > After finding the command wrap a pdsh -a and double quotes " around the set of chmod commands. This will execute for all machines in /etc/pdsh/machines which is n1, n2, n3
 # CTRL + R sed
   > Find the set of commands we executed on h1 for editing the ssh configuration file and restarting services. Wrap pdsh -a and double quotes " around it.
 # pdsh -a "sed -i 's/^.*StrictHostKeyChecking.*$/StrictHostKeyChecking=no/' /etc/ssh/ssh_config ; service sshd restart"
 # ssh n1
 # ssh n2
 # ssh n3
   > Connect from head node h1 to node1 from node1 to node2, and from node2 to node3. This is a quick check to makesure host aliases and password-less ssh is working on all nodes.
   > May want to even go from n3 back to another node in the cluster to test n3 just in case.
 # exit
 # exit
 # exit
   > Return to head node terminal input




EXECUTE H1:
 # pdsh -a "rm /etc/yum.repos.d/*"
   > Remove all existing repository definitions. No worries hear as we backed up the originals on the head node, and these are T&D vm images after all...
 # pdsh -a "ls /etc/yum.repos.d/"
   > Verify your handiwork, the command should not return a output from any of the nodes.
 # scp /etc/yum.repos.d/CentOS-Base.repo n1:/etc/yum.repos.d/ ; scp /etc/yum.repos.d/CentOS-Base.repo n2:/etc/yum.repos.d/ ; scp /etc/yum.repos.d/CentOS-Base.repo n3:/etc/yum.repos.d/
   > Transfer over definition of CentOS repositories from head node
 # CTRL + R rpm
 # pdsh -a "rpm -Uvh http://public-repo-1.hortonworks.com/HDP-1.1.1.16/repos/centos5/hdp-release-1.1.1.16-1.el5.noarch.rpm" | dshbak
   > Execute pdsh command to run rpm to download HDP repository definition file across all nodes
   > Pipe output from above pdsh command to all nodes through dshbak command which will format the output to line up across each node




EXECUTE H1:
 # pdsh -a 'ls /etc/yum.repos.d/' | dshbak
   > List the yum repository definition directory to verify the CentOS and HDP repository definition file on each node.



You have now setup the nodes and are ready to start installing the software for Hadoop!









Configuration Details (see below this section for full history of commands with descriptions) 
 Installation Details
  • Private Key File
    • Use the .pem downloaded
  • Hostdetail.txt
    • Include the private-dns address to each node in your EC2 instance.
    • Make sure to have only 1 host per line, with no leading/trailing blank lines. 
  • Installation Path
    • Create a path on the /mnt directory and grant all permissions
    • Make sure to use this path and 'uncheck' the default path selected by the HDP Installation wizard.
 # mkdir /mnt/hdp1 /mnt/hdp1/1 ; chmod 777 /mnt/hdp1 /mnt/hdp1/1
   
  • HBase
    • Change the java heap size for the Master and Regionserver
      • Master: 512MB
      • Regionserver: 1024MB
 ntp / ntpdMust be enabled.

The AMI we are using for CentOS 5.8 (and most distros for that matter) will not have the ntp service installed or configured.
To check if ntp is installed on your system check /etc/init.d/ directory for the ntpd (ntp daemon) or through issuing:

 # service ntpd status
   [ check on the status of the service. Will give an error message if no such service exists ]
 # rpm -qa | grep ntp
   [ will check if ntp was ever installed on this machine through the rpm database ]

If you find that ntpd service is not running and ntp is not installed on your machine use yum to install:

 # yum install -y ntp

Once ntp has successfully been installed on your node (and needs to be on all nodes) execute the following commands to
turn the ntp service on and keep the service on:

 # service ntpd start
 # chkconfig ntpd on
 # service ntpd status
 # chkconfig --list ntpd

 SELinuxMust be disabled.

Commands to check status:

 # vi /etc/selinux/config
   [ look for the line where selinux='...' it should be equal to disabled and not enforcing ]
 # sestatus
   [ This command will tell you what the current status is, but may not always work ]

Commands to turn off:
   
 # setenforce 0

 # echo 0 > /selinux/enforce

 # vi /etc/selinux/config
   [ change selinux variable = to disabled ]

 # vi /boot/grub/grub.conf
   [ set or add line for selinux=0 ] 
 
 iptablesMust be disabled.

Commands to check status:

 # service iptables status
   
 # /etc/init.d/iptables status
   
 # chkconfig --list iptables
   [ check what run control levels iptables is turned on/off for ]
   
Commands to turn off:
   
 # service iptables stop
   [ turn off the iptables firewall service ]
 # /etc/init.d/iptables stop

  chkconfig iptables off
   [ do not allow iptables to turn on with boot/reboot  ]
 # chkconfig --levels 2345 iptables off
   [ manually set which run control levels iptables is turned off for ]

Note: depending on your installation/distribution you may have the ip6tables service installed and/or possibly running on your
machine. As a pre-caution it does not hurt to include within your command the same commands to stop ip6tables, which 
would be the following:

  service ip6tables stop ; chkconfig ip6tables off


Command History & Descriptions

REQUIREMENTS:
  • Followed steps in last 2 guides "Setting Up EC2" & "Setting Up Linux"
  • 4 Node instances on Amazon EC2 with correct SSH settings, hosts, and Repository Definitions



 > Here we will check the nodes in the cluster for any pre-installed software that may conflict with our Hadoop installation.

EXECUTE H1: 
 # rpm -qa | grep -ie ruby -ie passenger -ie nagios -ie ganglia -ie puppet -ie rrdtool -ie mysql
  > Check for existing software installations
 # yum erase -y ruby* rrdtool*
  > Remove existing installations of ruby and rrdtool from head node
 # pdsh -a "yum erase -y ruby* rrdtool*"
  > Remove existing installations of ruby and rrdtool from node 1, node 2, node 3




EXECUTE H1: 
 # service ntpd status
  > Check the ntp daemon status. Should come back with unrecognized errors signifying ntp is not installed.
 # yum install -y ntp
  > Install the ntp service on the head node
 # pdsh -a "yum install -y ntp"
  > Install the ntp service on N1, N2, N3
 # chkconfig ntpd on ; chkconfig iptables off ; chkconfig ip6tables off
  > Configure startup settings of what services should run on startup. Configure ntp to start on startup. Configure iptables and ip6tables firewalls to be stopped at startup.
 # pdsh -a "chkconfig ntpd on ; chkconfig iptables off ; chkconfig ip6tables off"
  > Repeat the command above across N1, N2, N3
 # pdsh -a reboot
 # reboot
  > Reboot all the nodes in the cluster. This will make sure all settings take affect, and many times is a best practice before starting to flush out any issues that may be present.

(wait a couple of minutes for the servers to come back online)

EXECUTE H1: 
 # service ntpd status ; service iptables status ; service ip6tables status
  > Check the status of each one of the services, this should be ntpd=running, iptables+ip6tables=not running
  # pdsh -a "service ntpd status ; service iptables status ; service ip6tables status" | dshbak
  > Check above command across all other nodes in the cluster.





  > We will install the HMC software on the head node to prepare the node cluster for the Hadoop installation.

EXECUTE H1: 
 # yum install -y epel-release
 # pdsh -a "yum install -y epel-release"
  > Install the epel repository on the nodes. This gives access to dependant software necessary for the HMC installation.
 # yum install -y php-pecl-json
 # pdsh -a "yum install -y php-pecl-json"
  > Install php-pecl-json on the nodes. This gives access to the PECL repository as well as standard PHP and PHP extensions.
 # yum install -y hmc
  > Install HMC on the head node

EXECUTE H1: 
 # df -h
  > Get information for the mounted disk drives. We see that /mnt has 140GB of space available and is the preferred directory to install Hadoop
 # mkdir /mnt/hdp1 /mnt/hdp1/1 ; chmod 777 /mnt/hdp1 /mnt/hdp1/1
 # pdsh -a "mkdir /mnt/hdp1 /mnt/hdp1/1 ; chmod 777 /mnt/hdp1 /mnt/hdp1/1"
  > Create the directory /mnt/hdp1/1 on all nodes and grant read/write/execute to all users to this directory so we hit no permissions issues during install.
 # service hmc start
 # KEYBOARD INPUT=y
 # KEYBOARD INPUT=y
  > This will start the HMC service and download any JAVA dependancies.




  > Grab the Public DNS address to your Head node H1 and navigate your web browser to the following address
  > https://your-head-node-public-dns/hmc/html/index.php

  Click "Get Started"

Create Cluster
  1.  Name your cluster, I named mine 'seanc'

Add Nodes
  1. Upload the private key file you downloaded when you first created your instances, this is the .pem, 'hdp-privkey1.pem' for us. Upload for the 'SSH Private Key File for root'
  2. Create a file called "Hostdetail.txt" which lists the private DNS address for each node in the cluster (including the head), one per line. No leading/trailing spaces or lines (THIS IS IMPORTANT!)
  3. Upload the Hostdetail.txt file as your 'Hosts File'
  4. Leave 'Use local yum mirror instead of download packages from the internet' unchecked.
  5. Click 'Add Nodes'


 


 

Select Services
  1. Leave all services checked and click 'Select Services'

Assign Hosts
  1.  Select the dropdown for each service such that the following takes place:
    1.  Head Node - HMC Server, Templeton Server, Nagios Server, Ganglia Collector
    2.  Node1 - NameNode, ZooKeeper, HBase Master, Oozie Server, Hive MetaStore, JobTracker
    3.  Node2 - ZooKeeper
    4.  Node3 - Secondary NameNode, ZooKeeper
  2. Click Next
             (These assignments are defintely up for debate, and I welcome anyones recommendations on what to assign where.)


Select Mount Points
  1. Uncheck the default mount directory (in our case '/mnt')
  2. Use the Custom mount point we created '/mnt/hdp1/1'
  3. Click Next

Custom Config / Customize Settings
  1. Nagios Tab
    1. Fill-in Nagios Admin Password & Email
  2. Hive/HCatalog Tab
    1. Fill-in MySQL Password
  3. HBase Tab
    1. Set 'HBase Region Servers maximum Java heap size' to 1024MB or greater
    2. Set 'HBase Master Maximum Java heap size' to 512MB or greater
  4. Click 'Finished customizing all components
  5. Review Settings
  6. Click 'Deploy'
 



 (Process takes about ~40 minutes on 4 small Amazon EC2 instances)



EXECUTE H1:
  > May need to restart your sessions 

 # service hmc restart
  > Restart the hmc service to have installation take affect



You have now created and installed a 4 Node Hadoop cluster. Congratulations !

Comments