Movies

Group Members

Xi Liu, Yupeng Su, Ye Wang, Cheng Zhu

URLs

Github Repo: https://github.com/uxroc/infoweb
Data Source: https://www.kaggle.com/tmdb/tmdb-movie-metadata

Overview

In the project, we have implemented

a content-based recommendation system
a search engine

based on about 5,000 movies from TheMovieDB. Other website functionalities are pagination and redirecting to the TMDB movie pages.

How to Run

Dependencies

pip install nltk
pip install django

Instruction

git clone https://github.com/uxroc/infoweb
go to directory website
Run python manage.py runserver
Open a browser, and enter https://localhost:8000
Enjoy!

Demo

>>recommendation

Running recommendation the first time requires computing from scratch. So it will be quite slow. Please be a little bit patient. :)
However, response delay will be acceptable for further recommendation queries, because we store the results into files, and thus do not need to recompute.

>>search engine

Similar to recommendation, we compute from scratch only for the first query, and directly load results from files for further queries. So the website will take some while only when responding to the first query.

Recommendation Implementation

We have implemented a content-based recommendation. There are two stages: cosine similarities and ratings.

Cosine Similarity

feature vector: (overview, keyword, genre, actor, director, title). Overviews are tokenized by PorterStemmer. Keywords and genres are split into single words. Other features are operated in their original forms.
tfidf: adopted function TfidfVectorizer from sklearn.feature_extraction.text
similarity: adopted function cosine_similarity from sklearn.metrics.pairwise

Rating

After retrieving the top 20 most similar movies to the queried movie, we sort the movies according to their ratings. Rating computation comprises:

tmdb-popularity: a tmdb popularity rating for movies. Reference: https://developers.themoviedb.org/3/getting-started/popularity
vote-average and years: in addition to tmdb-popularity, we emphasize vote-average, which is the mean of all ratings from tmdb users for a movie, and years, which is the release year of a movie. More recent movies will be more preferred in general.
Combination: to combine the three measurement, we firstly convert raw vote-average and years into density probabilities by Gaussian functions. Assume the converted values are g_avg and g_yrs, the rating is then calculated by g_avg * g_yrs * popularity^2. The calculation is a modification of the movie rating formulas by FabienDaniel.

As suggested before, computing from scratch is slow. But it happens only once. Since movie data are static, we store similarity and rating matrix results into numpy files. For further queries, we read and return results directly from files, which will contribute to an acceptable response speed.

Search Engine Implementation

We implemented the movie search by calculating the similarity scores among the query and movies.

functions: the search engine could support two use cases: 1. user could search by movie title. 2. if user do not know the exact movie title, user could search the movie by some keywords, actor names and director name.
feature vector: (overview, keyword, movie title, actor, director). Each feature in feature vector is tokenized by PorterStemmer. Stop words are removed(Stop words have significant influence on the calculation of the score if not removed).
weights: different weights are assigned to attributes of a movie(weighted zone scoring). The basic idea behind the wight is giving the higher weights to more 'informative' features. This means the higher weights are given to the features who can represent the movie. Because the data set used here does not include any label and it is hard to evaluate the model, the weights are not chosen by machine learning model. The weights comes from different experiments and human evaluation.

Same with recommendation, computing from scratch is slow. But it happens only once. Score matrix results into numpy files. For further queries, results are directly return from files.

Google Sites

Report abuse