Recommendation Problems

Groups:

Group 1 (xena01-03) - Fordin, Sarah; Bomer, Dan; Yang, Mary

Group 2 (xena04-06) - Koeller, Jordan; Burnett, Jesse; Skogman, Brett

Group 3 (xena07-09) - Whitten, Marcus; Andres, Robbie; Viltoft, Jorgen

Group 4 (xena10-12) - Croxton, John; Walker, Blair; Reyes, Miguel

Group 5 (xena13-15) - Holloway, Taylor; Samoray, Nicholas; Herbert, Emily

Group 6 (xena16-18) - Chang, Stephen; Burton, Craig; Newton, Michael

Group 7 (xena19-21) - Usiri, Calvin; Witecki, Ian; Ang, Sam

Data Set:

This week's data set comes from the Netflix Prize (data and description at https://www.kaggle.com/netflix-inc/netflix-prize-data). This was data from a competition ran to get people to develop a better movie recommendation system than what they had in place at the time. You can find a copy of the data set in /data/BigData/Netflix/. Note that the Kaggle hosting of this data set consolidated files. So instead of having 177000 separate files, with one for each movie, this data set has four large files that contain the same information. Note that for our analysis, we are going to ignore dates to keep things simpler.

Remember that because this is actual ranking data, you need to call setImplicitPrefs(false) on the ALS object.

Also, because this data set is large and needs to be loaded in locally before being pushed out to Spark, you might need to increase the memory available to things. You can give the JVM more memory in sbt with the command export SBT_OPTS="-Xmx8g". You can replace "8g" with something bigger, but it needs to fit inside of the available memory of the machine you are running on. For submitting to the cluster with spark-submit, you can use the --driver-memory and --executor-memory options. For example, you might include "--driver-memory 8g --executor-memory 8g" on your call to spark-submit. These can also be specified as options to the Spark builder in your program. If you are using Eclipse, you can increase the maximum heap size by giving the JV argument of "-Xmx8g".

Note that in addition to only reading some of the large files, you can limit the size of the processing by only reading in a certain number of movies. I was able to do basic processing on my laptop with the first 1000 movies using 8G of memory in Eclipse.

In Class Questions:

Do these problems for movies with IDs less than 1000.

1. What is the range of user IDs?

2. How many distinct user IDs are there?

3. How many five star ratings has user 372233 given?

4. Which movie has the most user ratings? (Give both the number and the title.)

5. Which movie has the most 5 star user ratings? (Give both the number and the title.)

Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.

Between Class Questions:

All the code that you write to answer these questions should be put in a package called sparkml in the in-class repository. You should also make a file called sparkml-recommend.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

Note that while you are doing your development and testing, you can use a smaller section of the data set. I would recommend movie IDs less than 1000 and users with IDs less than 10,000. For your final answers use movie IDs less than 5000 movies and user IDs less than 100,000. I got that to work with a run time of less than 10 minutes using the cluster and setting both driver-memory and executor-memory to 8g. Note that those same settings also did 10,000 movies with 100,000 users, so I feel confident you can do the smaller number of movies.

1. Make top 5 movie recommendations for users. What are the titles of the 10 most commonly recommended movies and how many times was each one recommended?

2. You can check the quality of movie recommendations by splitting your data and training on a large set, then comparing recommendations to the ratings users gave in the other set. The example code for Collaborative Filtering shows you how to do this with a RegressionEvaluator.