Big Data Tutorial: Hive
NOTICE: This page refers to the Peel Hadoop Cluster which was retired in 2023 and is in the process of being archived. Its contents should not be referenced for Dataproc Courses.
What is Apache Hive?
Apache Hive is a Data Warehouse software that facilitates querying and managing large datasets residing in a distributed storage (Example: HDFS). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. The Hive Query Language (HiveQL or HQL) for MapReduce processes structured data using Hive.
It also provides:
Tools to enable easy data extract/transform/load (ETL)
A mechanism to impose structure on a variety of data formats
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Query execution via MapReduce
What is Hadoop?
Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.
What is HDFS?
HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
What is Peel?
Peel is the stand-alone Hadoop cluster running on Cloudera CDH version 6.3.4. Cloudera Enterprise(CDH) combines Apache Hadoop 3.0.0 and Apache Spark 2.4.0 with a number of other open-source projects to create a single, massively scalable system where you can unite storage with an array of powerful processing and analytic frameworks.
Steps to connect to Connect on Windows
Connect to NYU VPN
User hostname "peel.hpc.nyu.edu" with port 22 in putty. Provide your credentials.
Steps to connect to Connect on Mac
Connect to NYU VPN
Then from terminal, use "ssh <net id>@peel.hpc.nyu.edu". It will connect to the peel cluster.
For more information, please follow the instructions on this link
Components of Hive:
HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or performs Hive metadata operations using an HTTP (REST style) interface.
Hive is not...
A relational database
Designed for Online Transaction Processing (OLTP)
A language for real-time queries and row-level updates.
Data Types on Hive
All the data types in Hive are classified into four types, given as follows:
Column Types: Integrals (INT), String (CHAR), TimeStamp, Dates, Decimals & Union.
Literals: Floating Point
Null Values: NULL
Complex Types: Arrays, Maps & Structs
Basic Commands in HiveQL:
Accessing Hive
To access Hive there are two ways:
For Hive Grunt Shell: Type “beeline” on peel
For Hue UI: Provides a Hive interface -> http://demo.gethue.com/#
Using Hive Grunt Shell
Command Line Functions and MySQL EquivalentFunctions
Using Beeline:
Hive CLI is deprecated and migration to Beeline is recommended.
-bash-4.1$ beelinebeeline> !connect jdbc:hive2://hm-1.hpc.nyu.edu:10000/
Enter password for jdbc:hive2://hm-1.hpc.nyu.edu:10000/: **********
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> use <net_id>;
Hive Statements:
User databases are pre-created for users on Peel. The following query is used to verify the list of databases you have access to:
0: jdbc:hive2://hm-1.hpc.nyu.edu:10000/> show databases;Now lets try an example:
Before we start working with examples, please copy the Tutorial2 directory from '/share/apps/Tutorials/Tutorial2/' to '/home/netid/'
cp -r /scratch/work/public/peel/tutorials/Tutorial2/ $HOMEcd $HOME/Tutorial2
hdfs dfs -put $HOME/Tutorial2/user_posts.txt /user/<net_id>/
Example 1:
Create a table "messages" with columns user, post and time.
: Use 'describe' to display the list of columns in the table.
or
describe extended <table_name>;Load data into hive table.
Usage of 'select' statement.
Delete table. NOTE: Do not delete table 'messages', as the data from this table is used in the next example.
With Hive we are operating on the Apache Hadoop data store. Any query you write, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop. So when we perform our data processing task we will execute it one query or line at a time. Once a line successfully executes you can look at the data objects to verify if the last operation did what you expected. All your data is live, This kind of flexibility is Hive’s strength. You can solve problems bit by bit and change your mind on what to do next depending on what you find.
Hive is only pointing to the data on the HDFS file system but there is also an option of using "local inpath". Hive only stores the structure of the table and not the data. Data is always accessed from HDFS or the local machine.
Example 2: External Tables
Create an external table "messages2" with columns user, post and time.
Copy input data manually to HDFS location '/user/<net_id>/messages2' as given below.
Now, You can query the table 'messages2' from hive.
0: jdbc:hive2://babar.es.its.nyu.edu:10000/> select * from messages2;
PARTITIONING
Example 3: Static Partition
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Create a partitioned hive table. Here we are creating a partition for 'country' by using PARTITIONED BY clause.
Load data from table 'messages' which was created in the previous example.
Alternate way to load data into the table.
Now, You can query the table 'messages3' from hive.
Alternate way to see data i.e., Raw data display.
NOTE: If we go for the above approach, if we have 50 partitions we need to do the insert statement 50 times. That is a tedious task and it is known as Static Partition.
Example 4:Dynamic Partition
Static Partition columns: In DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
Dynamic Partition columns: Columns whose values are only known at EXECUTION TIME.
Create a partitioned hive table. Here we are creating the partition for 'country' by using PARTITIONED BY clause.
In order to achieve dynamic partition we need to execute below SET commands in hive.
set hive.exec.dynamic.partition=true : This enables dynamic partitions, by default it is false.
set hive.exec.dynamic.partition.mode=nonstrict : We are using the dynamic partition without a static partition (A table can be partitioned based on multiple columns in hive) in such case we have to enable the non-strict mode. In strict mode, we can use dynamic partition only with a Static Partition.
Load the data into the partitioned table 'messages4' from table 'messages' which was created in previous examples.
Now, You can query the table 'messages4'.
select * from messages4 where country = 'UK';
select * from messages4 limit 5;
select count(*) from messages4;
Example 5:Bucketing
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that is used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
In hive, bucketing does not work by default. You will have to set the following variable to enable bucketing. set hive.enforce.bucketing=true;
Create a bucketed table.
Load data into bucketed table 'messages5' from table 'messages' created in the previous example.
INSERT into table messages5
SELECT users,post,time,country;
Now, You can query the table 'messages5'.