It is a system
- Attempting to recommend products, which a user probably likes.
- Recommendations are based on the user’s preference profiles extracted from their consuming/rating history, including
- The content of consumed items;
- The explicit ratings given to those items.
- The product could be: Movies from IMDB/Netflix, Videos from Youtube, Books from Amazon, Music etc..
- Although peoples’ tastes vary, they do follow patterns, which can be used to predict such likes and dislikes.
– People tend to like things that are similar to other things they like.
– People tend to like things that similar people like.
- Recommendation is all about
– Estimating these patterns of taste, and
– Discovering new and desirable items
– That people didn’t already know.
Non-personalized vs personalized
Non-personalized
– Recommended items are generated without considering a user’s rating/consuming history.
– TopPop (Top Popular): recommends items with the highest popularity (largest number of ratings).
– MovieAvg (Movie Average): recommends top-N items with the highest average rating.
How to do Recommendations? - Prem George Recommender Systems
Personalized
– Recommended items are generated based on a user’s rating/consuming history.
– Collaborative Filtering
– Content-based Methods
Lets see a example on Recommendation with Python - Non-personalized Methods
Data From A Movie Recommender System
We use the data from https://grouplens.org/datasets/movielens/
MovieLens (100k)
– This data set consists of:
– 100,000 ratings (1-5) from 943 users on 1682 movies.
– Each user has rated at least 20 movies.
– Simple demographic info for the users (age, gender, occupation, zip)
• The data was collected
– through the MovieLens web site (movielens.umn.edu)
– during the seven-month period from September 19th, 1997 through April 22nd, 1998.
– This data has been cleaned up
– users who had less than 20 ratings or did not have complete demographic information were removed from this data set.
How to do Recommendations? - Prem George Recommender Systems
MovieLens -Ratings and User Info
• u.data
– The full u data set, 100000 ratings by 943 users on 1682 items.
– Each user has rated at least 20 movies.
– Users and items are numbered consecutively from 1.
– The data is randomly ordered.
– This is a tab separated list of
– user id | item id | rating | timestamp.
– The time stamps are unix seconds since 1/1/1970 UTC
• u.info
– The number of users, items, and ratings in the u data set.
MovieLens -Items (Movies)
• u.item
– Information about the items (movies); this is a tab separated list of
movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
– The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once.
– The movie ids are the ones used in the u.data data set.
How to do Recommendations? - Prem George Recommender Systems
MovieLens -Others
• u.genre
– A list of the genres.
• u.user
– Demographic information about the users;
– this is a tab separated list of
user id | age | gender | occupation | zip code
– The user ids are the ones used in the u.data data set.
• u.occupation
– A list of the occupations.
Steps
1. Load the data
– u.data: ratings
– u.item: movie information
2. Convert data into user-item rating matrix
3. Calculate the popularity of movies
4. Calculate the average rating of movies
5. Recommend the movies that the active user did not watch before, by using TopPop or MovieAvg.
6. Present the recommendations.
Load Packages
import numpy as np
import pandas as pd
Load the Rating Data
names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=names)
df.head()
Check the size of the data
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print str(n_users) + ' users'
print str(n_items) + ' items'
Construct the user-item binary ‘rating’ matrix: ratingsNum
ratingsNum = np.zeros((n_users, n_items))
for row in df.itertuples():
ratingsNum[row[1]-1, row[2]-1] = 1
print ratingsNum
itemRateNumCurrent = ratingsNum.sum(axis=0)
itemRateNumCurrent.sort()
Make a Plot of Popularity vs Solded items
import matplotlib.pyplot as plt
plt.plot(itemRateNumCurrent[::-1])
plt.xlabel('sorted items') # adds label to x axis
plt.ylabel('popularity') # adds label to y axis
plt.show()
Construct the user-item rating matrix: ratings
ratings = np.zeros((n_users, n_items))
Convert Data Into User-Item Rating Matrix
for row in df.itertuples():
ratings[row[1]-1, row[2]-1] = row[3]
print ratings
Calculate the popularity: the number of ‘rated’ times
ratingsNum = ratingsNum.sum(axis=0)
Calculate the total ratings obtained by each item
itemRateSum = ratings.sum(axis=0)
Calculate the average rating for each item
itemRateAvg = itemRateSum/ratingsNum
print itemRateAvg
Load the Item Data
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation','Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy','Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller','War', 'Western']
items = pd.read_csv('u.item', sep='|', names=i_cols,encoding='latin-1')
items.head()
Top 5 Popular Movies
Specify the number of recommended items
top_n = 5
Specify the ID of the active user
activeUser = 0
Mask the ‘rated’ items for the active user
mask_activeUser = ratings[activeUser, :] > 0
Make a copy of itemRateNum
itemRateNumCurrent = ratingsNum.copy()
Exclude rated items from recommendation by setting their popularity to 0
itemRateNumCurrent[mask_activeUser] = 0
Sort Movies by their popularity
itemSortInd = itemRateNumCurrent.argsort()
Recommend the top-N ranked items
print 'movie ID' + '\t movie title'
print items['movie title'][itemSortInd[range(len(itemSortInd)-1,len(itemSortInd)-top_n-1, -1)]]
Recommendation of Top N Rated Items
mask_activeUser = ratings[activeUser, :] > 0
Make a copy of itemRateAvg
itemRateAvgCurrent = itemRateAvg.copy()
Exclude rated items from recommendation by setting their average rating to 0
itemRateAvgCurrent[mask_activeUser] = 0
Sort Movies by their average rating
itemSortInd = itemRateAvgCurrent.argsort()
Recommend the top-N ranked items
print 'movie ID' + '\t movie title'
print items['movie title'][itemSortInd[range(len(itemSortInd)-1,len(itemSortInd)-top_n-1, -1)]]
Try Changing the active user ID to a different user, then try to generate the recommendation list for that user. Then, think which one is better.