How To Install Hortonworks HDP 1.2.0

Updated June, 19th 2013
Verified Install:
New: setup4hadoop.bash script file automates all of the preparation steps for installing Hadoop on EC2! 
  • Download the setup4hadoop.bash to the head node of your cluster and execute with the remaining nodes public address (up to 3) as command line arguments.
  • When finished (and there are no errors) start on Part 3.2) where it say's 'ambari-server setup'

 Type Details   Reasoning
OS   CentOS 6.2 CentOS is widely available and supported across many enterprises. 

Hortonworks - HDP 1.2.0
Hortonworks is an open-source distribution of Hadoop with open-source management tools staying aligned to and covering the Apache Hadoop software stack. 

Chosen distribution over pure Apache Hadoop to help better define requirements (OS, tested/supported releases) and ability to quickly deploy (automated installation).

Amazon EC2

4 x Medium - CentOS-x64-6.0-core (ami-03559b6a) - EBS Instance
 or as a backup
ami-043f9c6d (RightScale-CentOS 6.2) - note: no EBS 

Purpose is to quickly deploy a Hadoop cluster for Test & Development.

Low-cost, widely available, easily deployed, repeatable infrastructure.

Use medium-to-large size instances if you are going to do anything beyond the Hadoop installation.
 Client Windows 7
Putty , PSCP , PuttyGen
Chrome Web Browser
Greater distribution to audience using enterprise standard Windows
client machines to execute installation.

PostGres Version (9+) may be causing potential problems with Ambari
More information here
(special thanks to Phillip Burger for the catch)
 HDP Forum HDP Installation Frequently Asked Questions (FAQ)
HDP Installation Common Issues
HDP Installation - Support
Puppet Kick Fail
Puppet Certificate Issues
 RHEL/CentOS 6.X Always check forums when installing on the latest supported OS. HDP supports both RHEL/CentOS 5.X & 6.X so if you can go with 5.X do it. I've seen many installs on CentOS 5.7 and 5.8.
9-3-2012 - Couple of potential issues found, discussed in HDP Installation forum here 
 If all else fails...  Keep doing research, it can be done. Here are some potential configuration issues:
  • No network configuration, IP address, Ethernet device undefined
  • No DNS server or Reverse DNS capability
  • No software repositories defined, or incorrectly defined
  • Incorrect run control level / services not installed, or running, correctly
  • Permissions, Installing with users other than root
  • Software conflicts, old software versions or conflicting software needs to be removed.
  • Before starting installation reboot the nodes! this can bring out some errors that are not easily seen.

How To Install Hortonworks HDP  
~34 Minutes
Setting Up EC2 (0:00-5:56)
Setting Up Linux (5:57-19:21)
Installing Hadoop (19:22-33:49) 
(note video tutorial provides instructions for HDP 1.1.x on CentOS 5)

Step-By-Step Guides

Hadoop is still in beta format and constantly changing so some of the instructions may change from time to time. Also given that this is not my full time job and I cannot maintain the site as fast as I'd like to, people using this instruction need to be careful. Please use at your own risk and when in doubt, go to the official installation notes at

Configuration Details
 EC2 Details
  • Number of instances: 4
  • AMI Instance: ami-03559b6a
  • Instance size: medium
  • Kernel ID + RAM Disk ID: use default
  • Instance Details: use default
  • Keypair: download .pem file to client machine
  • Security Group: create new group with 'ALL TCP' enabled (
 Required Programs
  •  Putty - used to connect to Amazon EC2 instances with client machine
    • default user = root
    • authorized = .ppk file (generated by PuttyGen from downloaded .pem file)
  • PuttyGen - use with downloaded .pem file to create .ppk file.
  • pscp - putty transfer utility. Used through windows command line.
    • syntax: ~/> pscp -i abc123.ppk filename root@public-addr:/path/

Sign-up for Amazon EC2

o   Http://

1.1 Deploy Instances

·         Go-to AWS Management Console > EC2 Services > Click “Launch Instance

·         Choose the Classic Wizard: Select the Community AMIs tab > Select a AMI with CentOS 6.x

§  Recommended AMI: 

ami-03559b6a (RightScale-CentOS 6.0)

o   Select 4 instances Medium of type Launch Instance (instead of spot)

o   Kernel ID + RAM Disk ID = Use Default

o   Instance Details = Use DefaultCreate New Key Pair

§  1 key pair used for all instances named: hdp-privkey1.pem

§  Download hdp-privkey1.pem (DO NOT LOSE THIS)

o   DON’T FORGET: download .pem file to your local machine when creating a new key pair (hdp-privkey1.pem).

·         The .pem (private key) file allows your client machine to connect to the running Amazon EC2 instance through SSH.

·         If you lose the .pem you will need to re-create the instance, Amazon doesn’t store this file because of security reasons.

o   However you can stop, snapshot, and re-create a new instance based on this one so you don’t lose your 

     configuration (data should be on Amazon S3)

Create New Security Group

§  1 security group for all 4 instances:

·         Name group = hdp-sg1

·         Description = security port settings for 4 node cluster HDP installation

·         Add the rule for ALL ( and then save


§  Check status

·         Click on ‘Running Instances

·         Make sure state for each server = ‘running

·         Make sure Status check has  green checkmark


1.2 Connect To EC2 instances from Windows

Multiple Methods:

o   NEW: Amazon SSH Client (through browser)

o   Go to AWS Management Console > EC2 Dashboard

§  > Running Instances

§  > Select instance, click ‘Instance Actions’

§  > Click Connect


o   PreferredPutty

o   Required: download PuttyGen

o   Required: download Putty –SSH client


o   Create .ppk file for Putty SSH client

§  Open PuttyGen, and Click Conversions > Import Key

§  Navigate and select hdp-privkey1.pem that was created in previous steps.

§  Click Save, no passphrase, as: hdp-privkey1.ppk

o   Get the IP for the EC2 instance

§  Go to Amazon AWS, EC2 Services, Click Running Instances

§  Select the instance you want to connect to, and a button at the top called ‘instance actions’ becomes visible, click connect and copy 

    the public IP address for the instance:

o   Connect to EC2 master instance with Putty

§  Open Putty; enter the IP address in the Host Name field.

§  In the category tree to the left, select Connections > Data

·         In the Auto-login username field put ‘root’

§  In the category tree to the left, select Connections  > SSH > Auth

§  Under Authentication Parameters

·         Private Key File For Authentication

o   Hit browse, and select hdp-privkey1.ppk created in the last step

§  To Save the configuration settings

·         Go back to Session in the category tree to the left

·         Type a name for this configuration under Saved Sessions, I called mine “master-hadoop-ec2”

§  Click Open to connect to the EC2 instance.

Command History & Descriptions

  • Followed steps in last guide "How to Setup EC2"
  • Windows client with private key file, .ppk generated by PuttyGen, Putty SSH Terminal, and PSCP file transfer utility, which can be DOWNLOADED HERE
  • 4 running instances on Amazon EC2, minimum size medium, of AMI= ami-03559b6a
  • Putty ssh terminal configured to connect to the head node

  1. Open Chrome
  2. Navigate to
  3. Login to AWS Management Console
  4. Click on EC2 (dashboard)
  5. Click 'Running Instances'
    >   you should have 4 running instances dedicated to this installation which was created in the last guide "How to Setup EC2"

  1. Click the check mark next to the first of the 4 instances (we will designate this as the head node) 
  2. Click 'Instance Actions'
  3. Click 'Connect'
    > A 'Connect to an instance' box should popup. Copy the Public address, ours was Save to notepad and name h1
    > Repeat the above steps for the remaining 3 instances > Save each instance public address to the note file from above labeling each n1, n2, n2 (node1-3) in that order

  1. Run the putty client program
  2. Connect to the head node h1
    > Setting up putty to connect to an EC2 instance is covered in the last guide "How to Setup EC2"

  1. Open Windows Start
  2. Run: cmd (command line utility)
  3. Navigate to where the hdp-privkey1.pem and hdp-privkey1.pem are located, for us it was c:\ (root)
  4. Make sure the pscp.exe (putty scp) file is downloaded to this directory or its path is part of the windows global variable so it can be executed from any directory

EXECUTE Windows Client:
  c:\ pscp -i hdp-privkey1.ppk hdp-privkey1.pem
   > This will use the putty scp tool to upload your .pem (private key file) to the head node and locate it at /root/.ssh renaming the file to id_rsa

   > Set permissions on the head node for using the private key file

 # ls /root/.ssh
 # chmod 700 /root/.ssh ; chmod 640 /root/.ssh/authorized_keys ; chmod 600 /root/.ssh/id_rsa
   > Set /root/.ssh directory to owner=execute ; set public key file 'authorized keys' to owner=read+execute,group=read ; and set private key file 'id_rsa' to   

   > Gather the network information for the head node
 # echo -e "`hostname -i`\t`hostname -f`\th1"
   > print to stdout the IP address, Private address, h1 (host alias) using echo with special character '\t' to delimet the values with a tab character.
   > save this line for the head node h1 to the previously created notes file as it will be used later to populate the /etc/hosts file of each instance
   > You should be able to ssh from the head node h1 into the other instances (n1,n2,n3) now that we have uploaded the .pem as id_rsa and set permissions

 # ssh
   > Get the public address for node1/n1 that we copied in the previous steps and connect to this instance through the head node using SSH
   > You will notice the ssh tool asking you to authenticate the host, since this is T&D lets remove this message so we can ssh automatically into all of the nodes
 # sed -i 's/^.*StrictHostKeyChecking.*$/StrictHostKeyChecking=no/' /etc/ssh/ssh_config ; service sshd restart
   > This will use the sed command to search and replace from the /etc/ssh/ssh_config file the first occurrence of the variable StrictHostKeyChecking and set it equal to 'no'
   > The next command will restart the ssh daemon such that it can pick up this configuration file change
 # ssh
   > you should now be logged into node1/n1 through its public DNS address lets grab its network information including IP address, Private DNS address, and name it n1
 # echo -e "`hostname -i`\t`hostname -f`\tn1"
    > save this line for node1/n1 to the previously created notes file

 # exit
   > exit your ssh connection to n1 and return to the head node h1

   > REPEAT: the steps above for ssh into n2 and n3 using their public DNS address. Grab their network information for IP address, Private DNS address, and don't forget to change the alias to 
      n2, n3

 # vi /etc/hosts
 # KEYBOARD INPUT=lowercase: o
   > Setup your hosts file for the head node h1
   > VI lowercase o: Go into insert mode and add new line below current line
   > Paste the network configuration for h1,n1,n2,n3 in the format of one host per line, with IP\tPrivateDNS\talias format
   > VI: Exit insert mode
   > VI: Save and exit

   > You should now be able to ssh into the alias of each machine without having to re-copy the public/private dns address each time, try it:
 # ssh n1
 # exit
   > REPEAT: the steps above to ssh into each node n1,n2,n3 and vi into each instances /etc/hosts file to add the 4 lines for each host in the cluster

   > Let's make sure we have the standard CentOS mirrorlist of repositories for yum to pull from. We are going to want to install pdsh so lets grab the AMBARI repo as well.

 # rpm -Uvh
   > Install the Hortonworks Data Platform (HDP) repository file  

 # yum install -y pdsh
   > Install pdsh (from HDP repo) and automatically answer yes to installation questions
 # vi /etc/pdsh/machines
   > Add the private DNS address for n1,n2,n3 here, one per line with no leading or trailing line or spaces. Do not include the head node H1 as we are executing on that node
 # pdsh -a whoami
   > execute the command 'whoami' across all nodes defined in /etc/pdsh/machines file (in our case n1,n2,n3), should return with root for each machine, and you now have access to all


 # CTRL + R rpm
 # pdsh -a "

rpm -Uvh

" | dshbak

   > Execute pdsh command to run rpm to download AMBARI repository definition file across all nodes
   > Pipe output from above pdsh command to all nodes through dshbak command which will format the output to line up across each node

 # pdsh -a 'ls /etc/yum.repos.d/' | dshbak
   > List the yum repository definition directory to verify the CentOS and HDP repository definition file on each node.

You have now setup the nodes and are ready to start installing the software for Hadoop!

Command History & Descriptions

  • Followed steps in last 2 guides "Setting Up EC2" & "Setting Up Linux"
  • 4 Node instances on Amazon EC2 with correct SSH settings, hosts, and Repository Definitions

 > Here we will check the nodes in the cluster for any pre-installed software that may conflict with our Hadoop installation.

 # rpm -qa | grep -ie ruby -ie passenger -ie nagios -ie ganglia -ie puppet -ie rrdtool -ie mysql
  > Check for existing software installations
 # yum erase -y ruby* rrdtool* epel*
  > Remove existing installations of ruby and rrdtool from head node
 # pdsh -a "yum erase -y ruby* rrdtool* epel*"
  > Remove existing installations of ruby and rrdtool from node 1, node 2, node 3

 # service ntpd status
  > Check the ntp daemon status. Should come back with unrecognized errors signifying ntp is not installed.
 # yum install -y ntp
  > Install the ntp service on the head node
 # pdsh -a "yum install -y ntp"
  > Install the ntp service on N1, N2, N3
 # chkconfig ntpd on ; chkconfig iptables off
  > Configure startup settings of what services should run on startup. Configure ntp to start on startup. Configure iptables and ip6tables firewalls to be stopped at startup.
 # pdsh -a "chkconfig ntpd on ; chkconfig iptables off"
  > Repeat the command above across N1, N2, N3
 # pdsh -a reboot
 # reboot
  > Reboot all the nodes in the cluster. This will make sure all settings take affect, and many times is a best practice before starting to flush out any issues that may be present.

(wait a couple of minutes for the servers to come back online)

 # service ntpd status ; service iptables status
  > Check the status of each one of the services, this should be ntpd=running, iptables+ip6tables=not running
  # pdsh -a "service ntpd status ; service iptables status" | dshbak
  > Check above command across all other nodes in the cluster.

  > We will install the Ambari software on the head node to prepare the node cluster for the Hadoop installation.

 # yum install -y epel-release
 # pdsh -a "yum install -y epel-release"
  > Install the epel repository on the nodes. This gives access to dependent software necessary for the HMC installation.
 # yum install -y ambari-server
  > Install AMBARI on the head node


 # ambari-server setup
 # ambari-server start
  > This will start the Ambari service and download any JAVA dependencies.

  > Grab the Public DNS address to your Head node H1 and navigate your web browser to the following address

  Login, username & password = admin

  1.  Name your cluster, I named mine 'seanc'

Installation Options

    In this section make sure to include the private fully qualified address for each node in the cluster (remember # hostname -f) and the private key file (.pem) being used.


Choose Services
  1. Leave all services checked and click 'Select Services'

Assign Masters / Assign Slaves & Clients

    Leave this section to the default settings

Select Installation Points

    Use the default directories

Customize Services
  1. Nagios Tab
    1. Fill-in Nagios Admin Password & Email
  2. Hive/HCat Tab
    1. Fill-in MySQL Password

  3. Click 'Finished customizing all components
  4. Review Settings
  5. Click 'Deploy'

 (Process takes about ~20 minutes on 4 medium Amazon EC2 instances)
(Nagios may send some 'error' emails. This is okay)

You have now created and installed a 4 Node Hadoop cluster. Congratulations !

Note: if you are having issues, make sure you check the configuration before deploying, this is what my config looked like:

Admin Name : admin

Cluster Name : seanc

Total Hosts : 3 (3 new)

Local Repository : No


      NameNode : ip-10-110-162-58.ec2.internal
      SecondaryNameNode : ip-10-110-159-124.ec2.internal
      DataNodes : 1 hosts
      JobTracker : ip-10-110-159-124.ec2.internal
      TaskTrackers : 1 hosts
      Server : ip-10-110-162-58.ec2.internal
      Administrator : nagiosadmin / (
      Server : ip-10-110-162-58.ec2.internal
    Hive + HCatalog
      Hive Metastore : ip-10-110-159-124.ec2.internal
      Database : MySQL (New Database)
      Master : ip-10-110-162-58.ec2.internal
      Region Servers : 1 hosts
      Server : ip-10-110-159-124.ec2.internal
      Servers : 3 hosts