Big Data Tutorial: Hive

NOTICE: This page refers to the Peel Hadoop Cluster which was retired in 2023 and is in the process of being archived. Its contents should not be referenced for Dataproc Courses.

What is Apache Hive?

Apache Hive is a Data Warehouse software that facilitates querying and managing large datasets residing in a distributed storage (Example: HDFS). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. The Hive Query Language (HiveQL or HQL) for MapReduce processes structured data using Hive.

It also provides:

Tools to enable easy data extract/transform/load (ETL)
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Query execution via MapReduce

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.

What is HDFS?

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

What is Peel?

Peel is the stand-alone Hadoop cluster running on Cloudera CDH version 6.3.4. Cloudera Enterprise(CDH) combines Apache Hadoop 3.0.0 and Apache Spark 2.4.0 with a number of other open-source projects to create a single, massively scalable system where you can unite storage with an array of powerful processing and analytic frameworks.

Steps to connect to Connect on Windows

Connect to NYU VPN
User hostname "peel.hpc.nyu.edu" with port 22 in putty. Provide your credentials.

Steps to connect to Connect on Mac

Connect to NYU VPN
Then from terminal, use "ssh <net id>@peel.hpc.nyu.edu". It will connect to the peel cluster.

For more information, please follow the instructions on this link

Components of Hive:

HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.

WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or performs Hive metadata operations using an HTTP (REST style) interface.

Hive is not...

- A relational database
- Designed for Online Transaction Processing (OLTP)
- A language for real-time queries and row-level updates.

Data Types on Hive

All the data types in Hive are classified into four types, given as follows:

- Column Types: Integrals (INT), String (CHAR), TimeStamp, Dates, Decimals & Union.
- Literals: Floating Point
- Null Values: NULL
- Complex Types: Arrays, Maps & Structs

Basic Commands in HiveQL:

Accessing Hive

To access Hive there are two ways:

For Hive Grunt Shell: Type “beeline” on peel
For Hue UI: Provides a Hive interface -> http://demo.gethue.com/#

Using Hive Grunt Shell

Command Line Functions and MySQL EquivalentFunctions

Using Beeline:

Hive CLI is deprecated and migration to Beeline is recommended.

-bash-4.1$ beeline

beeline> !connect jdbc:hive2://hm-1.hpc.nyu.edu:10000/

Enter username for jdbc:hive2://hm-1.hpc.nyu.edu:10000/: <net_id>
Enter password for jdbc:hive2://hm-1.hpc.nyu.edu:10000/: **********
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> use <net_id>;

Hive Statements:

User databases are pre-created for users on Peel. The following query is used to verify the list of databases you have access to:

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> show databases;

Now lets try an example:

Before we start working with examples, please copy the Tutorial2 directory from '/share/apps/Tutorials/Tutorial2/' to '/home/netid/'

cp -r /scratch/work/public/peel/tutorials/Tutorial2/ $HOME
cd $HOME/Tutorial2
hdfs dfs -put $HOME/Tutorial2/user_posts.txt /user/<net_id>/

Example 1:

Create a table "messages" with columns user, post and time.

create table messages (users STRING, post STRING, time BIGINT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

: Use 'describe' to display the list of columns in the table.

describe <table_name>;

describe extended <table_name>;

Load data into hive table.

load data inpath 'hdfs://horton.hpc.nyu.edu:8020/user/<net_id>/user_posts.txt' overwrite into table messages;

Usage of 'select' statement.

select count(*) from messages;select * from messages where country='UK';select * from messages where users = 'user1';select * from messages where users = 'user1' or users='user2';select distinct users from messages;select count(distinct users) from messages;select * from messages order by time;select users,count(users) from messages group by users;

Delete table. NOTE: Do not delete table 'messages', as the data from this table is used in the next example.

drop table <table_name>;

With Hive we are operating on the Apache Hadoop data store. Any query you write, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop. So when we perform our data processing task we will execute it one query or line at a time. Once a line successfully executes you can look at the data objects to verify if the last operation did what you expected. All your data is live, This kind of flexibility is Hive’s strength. You can solve problems bit by bit and change your mind on what to do next depending on what you find.

Hive is only pointing to the data on the HDFS file system but there is also an option of using "local inpath". Hive only stores the structure of the table and not the data. Data is always accessed from HDFS or the local machine.

Example 2: External Tables

Create an external table "messages2" with columns user, post and time.

> CREATE EXTERNAL TABLE messages2(users STRING, post STRING, time BIGINT, country STRING)COMMENT 'Bigdata-Learning'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILELOCATION '/user/<net_id>/messages2';

Copy input data manually to HDFS location '/user/<net_id>/messages2' as given below.

hadoop fs -copyFromLocal /home/$USER/Tutorial2/user_posts.txt /user/$USER/messages2

Now, You can query the table 'messages2' from hive.

-bash-4.1$ beeline
0: jdbc:hive2://babar.es.its.nyu.edu:10000/> select * from messages2;

PARTITIONING

Example 3: Static Partition

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

Create a partitioned hive table. Here we are creating a partition for 'country' by using PARTITIONED BY clause.

create table messages3 (users STRING, post STRING, time BIGINT) PARTITIONED BY (country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Load data from table 'messages' which was created in the previous example.

00: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='US')SELECT users, post, time FROM messages WHERE country='US';0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='IND')SELECT users, post, time FROM messages WHERE country='IND';0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> INSERT INTO TABLE messages3 PARTITION(country='UK')SELECT users, post, time FROM messages WHERE country='UK';

Alternate way to load data into the table.

cp -r /scratch/work/public/peel/tutorials/Tutorial2/can_posts.txt /home/$USER/hdfs dfs -put $HOME/Tutorial2/can_posts.txt /user/$USER/0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> load data inpath 'hdfs://horton.hpc.nyu.edu:8020/user/<net_id>/can_posts.txt' overwrite into table messages3 PARTITION (country='CAN');

Now, You can query the table 'messages3' from hive.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> select * from messages3;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> select * from messages3 where country='CAN' or country='IND';

Alternate way to see data i.e., Raw data display.

hadoop fs -ls /user/hive/warehouse/<net_id>.db/messages3hadoop fs -cat /user/hive/warehouse/<net_id>.db/messages3/country=IND/000000_0

NOTE: If we go for the above approach, if we have 50 partitions we need to do the insert statement 50 times. That is a tedious task and it is known as Static Partition.

Example 4:Dynamic Partition

- Static Partition columns: In DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
- Dynamic Partition columns: Columns whose values are only known at EXECUTION TIME.

Create a partitioned hive table. Here we are creating the partition for 'country' by using PARTITIONED BY clause.

create table messages4 (users STRING, post STRING, time BIGINT) PARTITIONED BY (country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

In order to achieve dynamic partition we need to execute below SET commands in hive.
1. set hive.exec.dynamic.partition=true : This enables dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict : We are using the dynamic partition without a static partition (A table can be partitioned based on multiple columns in hive) in such case we have to enable the non-strict mode. In strict mode, we can use dynamic partition only with a Static Partition.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> SET hive.exec.dynamic.partition = true;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> SET hive.exec.dynamic.partition.mode = nonstrict;

Load the data into the partitioned table 'messages4' from table 'messages' which was created in previous examples.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> FROM messagesINSERT OVERWRITE TABLE messages4 PARTITION(country)SELECT users,post,time,country DISTRIBUTE BY country;

Now, You can query the table 'messages4'.

select * from messages4;
select * from messages4 where country = 'UK';
select * from messages4 limit 5;
select count(*) from messages4;

Example 5:Bucketing

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that is used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.

In hive, bucketing does not work by default. You will have to set the following variable to enable bucketing. set hive.enforce.bucketing=true;

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing=true;0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> set hive.enforce.bucketing;

Create a bucketed table.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> create table messages5 (users STRING, post STRING, time BIGINT, country STRING)clustered by (country) into 3 bucketsrow format delimited fields terminated by ",";

Load data into bucketed table 'messages5' from table 'messages' created in the previous example.

0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> FROM messages
INSERT into table messages5
SELECT users,post,time,country;

Now, You can query the table 'messages5'.

select * from messages5;select * from messages5 TABLESAMPLE (BUCKET 1 OUT OF 3);select * from messages5 TABLESAMPLE (BUCKET 2 OUT OF 3);select * from messages5 TABLESAMPLE (BUCKET 3 OUT OF 3);

Page updated

Report abuse