BIG DATA - Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. Its allows users to write queries in SQL -like language called HiveQL or HQL. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.It is a platform used to develop SQL type scripts to do MapReduce operations.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Hive efficiently converts your queries into Map reduce task at the backend.
All the data types in Hive are classified into four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
What is Partitions
Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys.
Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements for determining how the data is stored in the table
What is Buckets
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used for efficient querying.
The data i.e. present in that partitions can be divided further into Buckets
The division is performed based on Hash of particular columns that we selected in the table.
Buckets use some form of Hashing algorithm at back end to read each record and place it into buckets
In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
https://www.guru99.com/hive-partitions-buckets-example.html
When the partitions directories still exist in the HDFS, simply run this command:
MSCK REPAIR HIVE EXTERNAL TABLES
Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>