2 on cluster: a program will use Collaborative Filtering to determine this user will be interested.
3 on webpage: show top 10 results based on predicted likelihood.
Project – Yelp data analysis and recommender system
Use MapReduce to process “Review.json” file, Implement pig to process “business.json” file, Implement hive to process “business.json” file too and export to HBase from hive by create “mapping tables”. Basically, hive and pig are also MapReduce.
Keywords: MapReduce, hive, pig, Hbase, ETL.
Objectives:
Design MR Java program to load data into Hadoop.
Extract “text” and “stars” these two items and add positive (if stars is over 3) and negative (if stars is under or equal to 3). These five items are used to implement machine learning to tell the “accuracy & truth” of the review and decide to recommend some restaurants to customers.
Step 1 Upload data and MapReduce jar file into cluster and make input directory.
Step 2 Command “Hadoop jar “jar file name” + “class directory” + “input directory” + “output directory”.
Key in code:
Step1 Upload data and import necessary jar like piggypank, json and so on.
Step 2 Process json file. Foreach “file name” generate chararray. Because there is no schema in json file, json file is row by row. So transfer it to chararray. And replace \\n with “,” to make the “address” display in one row. Same operation to the other columns. Finally, store “new file” into a directory on hdfs.
Step 3 Right now we can work on data. Use “split” command to group data into different category and use “dump” command to get result.
From this project, I can prove that pig is a “lazy” language because not all the command are executable. For example “dump” and “store” are executable, but Foreach…Generate is not. On the other hand, pig is easy-going to loading data because pig comes with a lot of library like piggybank.jar. However, pig is not good for query, it has limited query optimization
Step 1, Add hive-serdes.jar which is used to json type file and load data.
Step 2, Create HBase table from hive. Create column family for all the item and match data type.
Step 3 Create external table (Compared to internal table, when you drop external table, other users still can query this table. But internal table will be deleted.). And then, match data type to each column and store this table.
Step 4 Interacting with data (from source_hive_table insert into table my_hbase_table).
Note: It is not able to load data into Hbase from hive. Because hive is data warehouse not database, it is all about files and the tables does not exist.
Hive, pig and MapReduce are batch process which are widely used. Pig and hive execute MapReduce too. Compared to MapReduce, HQL has huge user base, easy to code and easy to query. And MapReduce has to use M/R model which lead to not reusable and long development type. On the other hand, hive and pig are able to integrate with HBase. So hive is very popular batch process in industry.