Install Hive on Hadoop

Intalled Hive 0.14 on a 3-node Hadoop cluster.

1. Create a separate EC2 instance for running Hive, and download a stable release hive-x.y.z.tar.gz

    Keep in mind that the security group setting (firewall setting) on Hadoop namenode & job tracker should be configured properly to allow the Hive server to access.

    Normally, master:54311 and hdfs://master:54310 should be accessible from Hive.

    Set up the hduser:hadoop user & usergroup.

    

2. Unpack the gz file and move the hive folder to /usr/local.

   tar -xzf hive-x.y.z.tar.gz

   sudo mv hive-x.y.z /usr/local/hive

   chown -R hduser:hadoop /usr/local/hive

   

   Also copy the hadoop folder to /usr/local/hadoop. No need to configure Hadoop as HIVE simply uses the Hadoop command for communicating with the Hadoop cluster. The hive server is not part of the Hadoop cluster but runs on top of the Hadoop cluster.

3. Set HIVE_HOME, HADOOP_HOME and JAVA_HOME in .bashrc.

   export HADOOP_HOME=/usr/local/hadoop

   export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

   export HIVE_HOME=/usr/local/hive

   export PATH=$HIVE_HOME/bin:$PATH

   

4. SSH to the Hadoop master node and create /tmp and metastore folder on HDFS. Make sure the Hadoop instance has been started.

   /hadoop/bin/hadoop dfs -mkdir /tmp

   /hadoop/bin/hadoop dfs -mkdir /user/hive/warehouse 

   /hadoop/bin/hadoop dfs -chmod g+w /tmp

   /hadoop/bin/hadoop dfs -chmod g+w /user/hive/warehouse 

   

   The default location for metastore is /user/hive/warehouse

      

5. Go back to the hive server. Create hive-site.xml in /usr/local/hive/conf and set mapred.job.tracker & hdfs namenode properties.

   

<configuration>

 <property>

  <name>mapred.job.tracker</name>

  <value>master:54311</value>

 </property>

 <property>

  <name>fs.default.name</name>

  <value>hdfs://master:54310</value>

 </property>

</configuration>

    Hive will get the access information fro the hadoop cluster from here. Other properties can be setup here as well.

6. At this point, it is ready to run a Hive instance.

  /usr/local/hive/bin/hive

7. Test HiveQL. A database row is delimited by Ctrl-A. Here we use ",".

Prepare a test.txt file as follows:

  1, hello

  2, world

  3, good

  4, day

Create a table and specify the column delimiter as ",".

hive> create table test (a int, b string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

Table is created OK:

hive> show tables;

hive> describe test;  

Load the test.txt file to the table and select the content. Without the 'local' keyword, hive will load from hdfs.

hive> load data local inpath '/home/hive/test.txt' overwrite into table test;

hive> select * from test;

NOTE: if the HDFS is in safe mode (default?), hive is not able to load files to HDFS, leave safe mode first:

hadoop dfsadmin -safemode leave  

8. Have a look at HDFS. 

  hadoop dfs -ls /user/hive/warehouse/test

  

  Hive creates a new folder 'test' under 'warehouse' to store the raw txt file.

  Hive maps the table schema (metadata) to the raw file so users can do SQL-like queries

  over the raw file.

In the above example, the metadata of hive is stored locally in Derby database (local files) which support only one query session a time.

To support multiple sessions, need to configure a MySql database to store the metadata.