This project is to calculate movie genres’ average rating value and count their tag number for each year.
The used programming language here is Java. The implemented packages here are Spark HBase and Spark JavaRDD.
The selected dataset is MovieLens 20M Dataset, including 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.
The first step is to load three CSV files and fuse them together based on key-value, movieID. Then the newly generated dataset is saved as the HBase dataset.
The second step is to load the HBase dataset and code to solve the required questions. Then the genres’ average rating and tag counting for each year are printed out and saved as a text file.
CONCLUSION: Both two parts are solved with Java Spark and run on the cluster. The program takes around 4 minutes to process and output results smoothly.