Big Data Tutorial: Hive

NOTICE: This page refers to the Peel Hadoop Cluster which was retired in 2023 and is in the process of being archived. Its contents should not be referenced for Dataproc Courses.

What is Apache Hive? 

Apache Hive is a Data Warehouse software that facilitates querying and managing large datasets residing in a distributed storage (Example: HDFS). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. The Hive Query Language (HiveQL or HQL) for MapReduce processes structured data using Hive. 

It also provides:

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.

What is HDFS? 

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

What is Peel?

Peel is the stand-alone Hadoop cluster running on Cloudera CDH version 6.3.4. Cloudera Enterprise(CDH) combines Apache Hadoop 3.0.0 and Apache Spark 2.4.0 with a number of other open-source projects to create a single, massively scalable system where you can unite storage with an array of powerful processing and analytic frameworks.

Steps to connect to Connect on Windows

Steps to connect to Connect on Mac

For more information, please follow the instructions on this link

Components of Hive:

HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.

WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or performs Hive metadata operations using an HTTP (REST style) interface.

Hive is not...

Data Types on Hive

All the data types in Hive are classified into four types, given as follows:

Basic Commands in HiveQL:

Accessing Hive

To access Hive there are two ways:

Using Hive Grunt Shell

Command Line Functions and MySQL EquivalentFunctions

Using Beeline:

Hive CLI is deprecated and migration to Beeline is recommended.

-bash-4.1$ beeline
 
beeline> !connect jdbc:hive2://hm-1.hpc.nyu.edu:10000/
Enter username for jdbc:hive2://hm-1.hpc.nyu.edu:10000/: <net_id>
Enter password for jdbc:hive2://hm-1.hpc.nyu.edu:10000/: **********
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> use <net_id>;

Hive Statements:

User databases are pre-created for users on Peel. The following query is used to verify the list of databases you have access to:

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> show databases;

Now lets try an example:

Before we start working with examples, please copy the Tutorial2 directory from '/share/apps/Tutorials/Tutorial2/' to '/home/netid/'

cp -r /scratch/work/public/peel/tutorials/Tutorial2/ $HOME
cd $HOME/Tutorial2
hdfs dfs -put $HOME/Tutorial2/user_posts.txt /user/<net_id>/

Example 1:

create table messages (users STRING, post STRING, time BIGINT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
describe <table_name>;

or

describe extended <table_name>; 
load data inpath 'hdfs://horton.hpc.nyu.edu:8020/user/<net_id>/user_posts.txt' overwrite into table messages;
select count(*) from messages;select * from messages where country='UK';select * from messages where users = 'user1';select * from messages where users = 'user1' or users='user2';select distinct users from messages;select count(distinct users) from messages;select * from messages order by time;select users,count(users) from messages group by users;
drop table <table_name>; 

With Hive we are operating on the Apache Hadoop data store. Any query you write, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop. So when we perform our data processing task we will execute it one query or line at a time. Once a line successfully executes you can look at the data objects to verify if the last operation did what you expected. All your data is live, This kind of flexibility is Hive’s strength. You can solve problems bit by bit and change your mind on what to do next depending on what you find.

Hive is only pointing to the data on the HDFS file system but there is also an option of using "local inpath". Hive only stores the structure of the table and not the data. Data is always accessed from HDFS or the local machine.

Example 2: External Tables

> CREATE EXTERNAL TABLE messages2(users STRING, post STRING, time BIGINT, country STRING)COMMENT 'Bigdata-Learning'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILELOCATION '/user/<net_id>/messages2';
hadoop fs -copyFromLocal /home/$USER/Tutorial2/user_posts.txt /user/$USER/messages2
-bash-4.1$ beeline
0: jdbc:hive2://babar.es.its.nyu.edu:10000/> select * from messages2;

PARTITIONING

Example 3: Static Partition

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

create table messages3 (users STRING, post STRING, time BIGINT) PARTITIONED BY (country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
00: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='US')SELECT users, post, time FROM messages WHERE country='US';0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='IND')SELECT users, post, time FROM messages WHERE country='IND';0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='UK')SELECT users, post, time FROM messages WHERE country='UK';
cp -r /scratch/work/public/peel/tutorials/Tutorial2/can_posts.txt /home/$USER/hdfs dfs -put $HOME/Tutorial2/can_posts.txt /user/$USER/0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> load data inpath 'hdfs://horton.hpc.nyu.edu:8020/user/<net_id>/can_posts.txt' overwrite into table messages3 PARTITION (country='CAN');
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> select * from messages3;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> select * from messages3 where country='CAN' or country='IND';
hadoop fs -ls /user/hive/warehouse/<net_id>.db/messages3hadoop fs -cat /user/hive/warehouse/<net_id>.db/messages3/country=IND/000000_0

NOTE: If we go for the above approach, if we have 50 partitions we need to do the insert statement 50 times. That is a tedious task and it is known as Static Partition.

Example 4:Dynamic Partition

create table messages4 (users STRING, post STRING, time BIGINT) PARTITIONED BY (country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> SET hive.exec.dynamic.partition = true;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> SET hive.exec.dynamic.partition.mode = nonstrict;
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> FROM messagesINSERT OVERWRITE TABLE messages4 PARTITION(country)SELECT users,post,time,country DISTRIBUTE BY country;
select * from messages4;
select * from messages4 where country = 'UK';
select * from messages4 limit 5;
select count(*) from messages4;

Example 5:Bucketing

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that is used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing=true;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing;
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> create table messages5 (users STRING, post STRING, time BIGINT, country STRING)clustered by (country) into 3 bucketsrow format delimited fields terminated by ",";
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> FROM messages
INSERT into table messages5
SELECT users,post,time,country;
select * from messages5;select * from messages5 TABLESAMPLE (BUCKET 1 OUT OF 3);select * from messages5 TABLESAMPLE (BUCKET 2 OUT OF 3);select * from messages5 TABLESAMPLE (BUCKET 3 OUT OF 3);