Apache Hive

APACHE HIVE

The Apache Hive [10] data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Interactive Job

Example: Create a Grade table, see the columns and delete it [11]

Run Hive and execute the command at the hive prompt.

hive

hive> CREATE TABLE grades (name STRING, grade FLOAT);

OK

Time taken: 2.462 seconds

hive> SHOW TABLES;

OK

grades

Time taken: 0.213 seconds, Fetched: 1 row(s)

hive> DESCRIBE grades;

OK

name                    string                                      

grade                   float                                       

Time taken: 0.189 seconds, Fetched: 2 row(s)

hive> DROP TABLE grades;

OK

Time taken: 0.366 seconds

hive> SHOW TABLES;

OK

Time taken: 0.024 seconds

hive> quit;

Batch Job

Example: MovieLens User Ratings [11]

Copy data ml-100k from /home/sxg125/hadoop-projects/hive and cd to ml-100k

cp -r /home/sxg125/hadoop-projects/hive/ml-100k .

cd <path-to-ml-100k>

Modify the script "movieLens.sh", which is also in m1-100k directory, with the path to your data as showed:

....

#Load the data

hive -e "LOAD DATA LOCAL INPATH '<path-to-ml-100k>/u.data' OVERWRITE INTO TABLE u_data";

...

Run the bash script

./movieLens.sh

view the output

cat output 

output:

100000

1       13278

2       14816

3       15426

4       13774

5       17964

6       12318

7       12424