In the project, we have implemented
based on about 5,000 movies from TheMovieDB. Other website functionalities are pagination and redirecting to the TMDB movie pages.
pip install nltk
pip install django
git clone https://github.com/uxroc/infoweb
website
python manage.py runserver
https://localhost
:8000
We have implemented a content-based recommendation. There are two stages: cosine similarities and ratings.
PorterStemmer
. Keywords and genres are split into single words. Other features are operated in their original forms.TfidfVectorizer
from sklearn.feature_extraction.text
cosine_similarity
from sklearn.metrics.pairwise
After retrieving the top 20 most similar movies to the queried movie, we sort the movies according to their ratings. Rating computation comprises:
g_avg
and g_yrs
, the rating is then calculated by g_avg * g_yrs * popularity^2
. The calculation is a modification of the movie rating formulas by FabienDaniel.As suggested before, computing from scratch is slow. But it happens only once. Since movie data are static, we store similarity and rating matrix results into numpy files. For further queries, we read and return results directly from files, which will contribute to an acceptable response speed.
We implemented the movie search by calculating the similarity scores among the query and movies.
PorterStemmer
. Stop words are removed(Stop words have significant influence on the calculation of the score if not removed).Same with recommendation, computing from scratch is slow. But it happens only once. Score matrix results into numpy files. For further queries, results are directly return from files.