Recommender Systems

Introduction

Recommender Systems in Python

It is a system

- Attempting to recommend products, which a user probably likes.

- Recommendations are based on the user’s preference profiles extracted from their consuming/rating history, including

- The content of consumed items;

- The explicit ratings given to those items.

- The product could be: Movies from IMDB/Netflix, Videos from Youtube, Books from Amazon, Music etc..

How to do Recommendations?

- Although peoples’ tastes vary, they do follow patterns, which can be used to predict such likes and dislikes.

– People tend to like things that are similar to other things they like.

– People tend to like things that similar people like.

- Recommendation is all about

– Estimating these patterns of taste, and

– Discovering new and desirable items

– That people didn’t already know.

Non-personalized vs personalized

Non-personalized

– Recommended items are generated without considering a user’s rating/consuming history.

– TopPop (Top Popular): recommends items with the highest popularity (largest number of ratings).

– MovieAvg (Movie Average): recommends top-N items with the highest average rating.

How to do Recommendations? - Prem George Recommender Systems

Personalized

– Recommended items are generated based on a user’s rating/consuming history.

– Collaborative Filtering

– Content-based Methods

Lets see a example on Recommendation with Python - Non-personalized Methods

Data From A Movie Recommender System

We use the data from https://grouplens.org/datasets/movielens/

MovieLens (100k)

– This data set consists of:

– 100,000 ratings (1-5) from 943 users on 1682 movies.

– Each user has rated at least 20 movies.

– Simple demographic info for the users (age, gender, occupation, zip)

• The data was collected

– through the MovieLens web site (movielens.umn.edu)

– during the seven-month period from September 19th, 1997 through April 22nd, 1998.

– This data has been cleaned up

– users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

How to do Recommendations? - Prem George Recommender Systems

MovieLens -Ratings and User Info

• u.data

– The full u data set, 100000 ratings by 943 users on 1682 items.

– Each user has rated at least 20 movies.

– Users and items are numbered consecutively from 1.

– The data is randomly ordered.

– This is a tab separated list of

– user id | item id | rating | timestamp.

– The time stamps are unix seconds since 1/1/1970 UTC

• u.info

– The number of users, items, and ratings in the u data set.

MovieLens -Items (Movies)

• u.item

– Information about the items (movies); this is a tab separated list of

Thriller | War | Western |

– The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once.

– The movie ids are the ones used in the u.data data set.

How to do Recommendations? - Prem George Recommender Systems

MovieLens -Others

• u.genre

– A list of the genres.

• u.user

– Demographic information about the users;

– this is a tab separated list of

user id | age | gender | occupation | zip code

– The user ids are the ones used in the u.data data set.

• u.occupation

– A list of the occupations.

Steps

1. Load the data

– u.data: ratings

– u.item: movie information

2. Convert data into user-item rating matrix

3. Calculate the popularity of movies

4. Calculate the average rating of movies

5. Recommend the movies that the active user did not watch before, by using TopPop or MovieAvg.

6. Present the recommendations.

Python Code

Load Packages

import numpy as np

import pandas as pd

Load the Rating Data

names = ['user_id', 'item_id', 'rating', 'timestamp']

df = pd.read_csv('u.data', sep='\t', names=names)

df.head()

Check the size of the data

n_users = df.user_id.unique().shape[0]

n_items = df.item_id.unique().shape[0]

print str(n_users) + ' users'

print str(n_items) + ' items'

Construct the user-item binary ‘rating’ matrix: ratingsNum

ratingsNum = np.zeros((n_users, n_items))

for row in df.itertuples():

ratingsNum[row[1]-1, row[2]-1] = 1

print ratingsNum

itemRateNumCurrent = ratingsNum.sum(axis=0)

itemRateNumCurrent.sort()

Make a Plot of Popularity vs Solded items

import matplotlib.pyplot as plt

plt.plot(itemRateNumCurrent[::-1])

plt.xlabel('sorted items') # adds label to x axis

plt.ylabel('popularity') # adds label to y axis

plt.show()

Construct the user-item rating matrix: ratings

ratings = np.zeros((n_users, n_items))

Convert Data Into User-Item Rating Matrix

for row in df.itertuples():

ratings[row[1]-1, row[2]-1] = row[3]

print ratings

Calculate the popularity: the number of ‘rated’ times

ratingsNum = ratingsNum.sum(axis=0)

Calculate the total ratings obtained by each item

itemRateSum = ratings.sum(axis=0)

Calculate the average rating for each item

itemRateAvg = itemRateSum/ratingsNum

print itemRateAvg

Load the Item Data

i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation','Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy','Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller','War', 'Western']

items = pd.read_csv('u.item', sep='|', names=i_cols,encoding='latin-1')

items.head()

Top 5 Popular Movies

Specify the number of recommended items

top_n = 5

Specify the ID of the active user

activeUser = 0

Mask the ‘rated’ items for the active user

mask_activeUser = ratings[activeUser, :] > 0

Make a copy of itemRateNum

itemRateNumCurrent = ratingsNum.copy()

Exclude rated items from recommendation by setting their popularity to 0

itemRateNumCurrent[mask_activeUser] = 0

Sort Movies by their popularity

itemSortInd = itemRateNumCurrent.argsort()

Recommend the top-N ranked items

print 'movie ID' + '\t movie title'

print items['movie title'][itemSortInd[range(len(itemSortInd)-1,len(itemSortInd)-top_n-1, -1)]]

Recommendation of Top N Rated Items

mask_activeUser = ratings[activeUser, :] > 0

Make a copy of itemRateAvg

itemRateAvgCurrent = itemRateAvg.copy()

Exclude rated items from recommendation by setting their average rating to 0

itemRateAvgCurrent[mask_activeUser] = 0

Sort Movies by their average rating

itemSortInd = itemRateAvgCurrent.argsort()

Recommend the top-N ranked items

print 'movie ID' + '\t movie title'

print items['movie title'][itemSortInd[range(len(itemSortInd)-1,len(itemSortInd)-top_n-1, -1)]]

Try Changing the active user ID to a different user, then try to generate the recommendation list for that user. Then, think which one is better.

Google Sites

Report abuse