Apache Hive
APACHE HIVE
The Apache Hive [10] data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Interactive Job
Example: Create a Grade table, see the columns and delete it [11]
Run Hive and execute the command at the hive prompt.
hive
hive> CREATE TABLE grades (name STRING, grade FLOAT);
OK
Time taken: 2.462 seconds
hive> SHOW TABLES;
OK
grades
Time taken: 0.213 seconds, Fetched: 1 row(s)
hive> DESCRIBE grades;
OK
name string
grade float
Time taken: 0.189 seconds, Fetched: 2 row(s)
hive> DROP TABLE grades;
OK
Time taken: 0.366 seconds
hive> SHOW TABLES;
OK
Time taken: 0.024 seconds
hive> quit;
Batch Job
Example: MovieLens User Ratings [11]
Copy data ml-100k from /home/sxg125/hadoop-projects/hive and cd to ml-100k
cp -r /home/sxg125/hadoop-projects/hive/ml-100k .
cd <path-to-ml-100k>
Modify the script "movieLens.sh", which is also in m1-100k directory, with the path to your data as showed:
....
#Load the data
hive -e "LOAD DATA LOCAL INPATH '<path-to-ml-100k>/u.data' OVERWRITE INTO TABLE u_data";
...
Run the bash script
./movieLens.sh
view the output
cat output
output:
100000
1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424