Report: Development Phase I:- Search

The Video Game Recommender System shows top 10 games list based on user reviews. The Game Finder App use Metacritic Video Game Comments. To find the top 10 games, the app uses inverted term frequency for finding similarities between the user query and the game.

Background

In the information retrieval, TF-IDF stands for term frequency-inverse document frequency. TF-IDF is tells us how important a particular word. Both tf and idf calculated separately.

After fiding tf-idf for each document and user query, we need to calculate consine similarity. Consine similiarity shows how similar two vectors.

Procedure

First, we need to load user comments and game titles.To load data, we will first load each line, then we will spilt the data to get title and user comments.
After loading the loading the data, we will remove stopwords and tokenize each word.
Now, we have created our token. We will use tokens to find TF_IDF value. We created two functions for tf and idf which will return tf and idf scores
We will use the following formula for TF_IDF score. Every every t token we will save the TF_IDF value.

Now, we will find TF_IDF for user query. We tokenize user uery and find TF_IDF score. We will use below formula to find TF_IDF score.

After getting all the TF_IDF score, we have to find consine similarity score. Finiding cosine similarity for each document and user query is inefficient. For this reason, we will create posting list. Using posting list we can find top 10 documents for each user query token. To find the cosine_similarity we will create following cosine similarity caluclaton function.

Challenging part

Building search was the most difficult phase of the Game Finder App. The most challenging thing I faced in this phase is creating the posting list. In the first version of the search feature, I created a search feature by checking all cosine similarities. But it was taking a lot of time to search. I approximately taking 25 seconds. Later I decided to create the posting list and app performance improved significantly. It is now taking less than 3 seconds. However, as I mentioned before creating the posting list was extremely difficult. I read over and over skeleton given by Dr. Deokgun Park.

Experiment

First the Game Finder App was development using all cosine similarities, but it takes a lot of time. However, later the game finder app was developed using posting list which is much faster than all cosine similarity version.
Stopwords: The game finder shows no result when stopwords are search.
Lemmatization & Stemming: Game Finder also used both lemmatization & stemming. It shows same result for "Funny Game" & "Funny Games"

References

For the Phase I, Game Finder used two references.

1. https://github.com/AdnanOquaish/Cosine-similarity-Tf-Idf-/blob/master/DocumentParser.java

2. https://stackoverflow.com/questions/27685839/removing-stopwords-from-a-string-in-java

What I did differently?

First reference describes algorithm for cosine similarities, and tf_idf. From this reference, Game Finder similar algorithm but it is different in implementation. In other words, the reference is used for to understanding the cosine similarities.

Second reference shows the steps which can be taken to remove stopwords. Game Finder used similar algorithm describe in this link.

Report: Development Phase II:- Classify

The classify feature will classify user query based on training data. To classify, Game Finder uses Multinomial Naive Bayes classifier.

Multinational Naive Bayes Classifier:

Multinomial naive bayes classifier is a very simple algorithm but surprisingly very fast. It is best suited for large data-set and supervised and learning. It assumes every training data is independent. It also assumes each class is independent from each other. Lets discuss step by step algorithm of multinomial naive bayes:

First we will load training data into the program from data-set. The training data-set contains details and genre of each document.
We will count the number of classes in the whole data-set.
We will perform stemming and lemmatization on the documents..
We will tokenize each document. We will also put each unique word in the bag of words.
Now we are ready to use multinomial naive bayes classifier. We will begin by calculating Prior probabilities of each classes. To find prior probabilities, we will count how many document has a particular class and we will divide that the number by the total number of class.

Now we will calculate conditional probabilities of each token of user query using the following formula. The formula use smoothing.

Now we will chose a class. To find most likely genre of user query, we will calculate P(c|user query).
After finding all the probabilities, we will select top 3 probabilities which will be shown in the output.

Challenging part

The most challenging part of building a classifier is to understand the algorithm. When I understand the algorithm, I was able to build the classifier very quickly. Another challenge was to build classifier in using Java since there are not many build in libraries compare to python.

References

In the development phase II, I used two references and these are:

Chapter 13 from Introduction to Information Retrieval: https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf
A video lecture of edureka! from YouTube: https://www.youtube.com/watch?v=psHrcSacU9Y

What i did differently ?

The first link i used to understand the algorithm. Basically, i used the book to understand how classifier works. I also brush up my understanding using a lecture video from edureka!. I didn't used any code from other sources. I built the classifier based on this link.

Report: Development Phase III:- Recommend

Recommending small portions of similar products from a large amount of dataset is challenging. To overcome this challenge, we will use a content-based recommender system. Content-based recommender system gives the most priority to user preference. Content-based filtering algorithm will recommend similar products based on their liking.

Game Finder App used content-based recommender system. Content-based algorithm recommends games based on user liking by calculating cosine similarities. To utilize the recommend feature, users first need to search for games. After searching games, Game Finder will show three games based on user search. Then, the user can find three more similar games by selecting one game from the search result list. Basically, the recommender system calculates cosine similarity two times.

Background:

TF-IDF: In the information retrieval, TF-IDF stands for term frequency-inverse document frequency. TF-IDF shows us how important a particular word. Both TF and IDF calculated separately.

Cosine similarity: After finding TF-IDF for each document and user query, we need to calculate cosine similarity. Cosine similarity shows how similar two vectors.

Content-based filtering: Content-based filtering recommends similar products by considering user preference. In this case, user preference will be user selected game. After selecting the user desire game, the content-based algorithm shows three more similar games based on the previously selected game.

Procedure:

Major steps for content-based recommender system are below:

Finding 3 games based on user query is very similar to the Search feature.

· Game Finder loads all documents from two .csv file and split data into title, platform, publisher, genre, players, release year, Metacritic rating, user rating, and user reviews.

· After loading data, Game Finder removes stop words and tokenize each word.

· Game Finder calculates TF_IDF score for each token.

· After setting TF_IDF for training data, Game Finder takes user query and perform stemming and lemmatization on the user query. Game Finder also tokenizes user query and calculates TF_IDF for each token.

· After finding all TF_IDF of training data and user query, Game Finder calculates cosine similarities between document and user query. Game Finder will show the top 3 similar games based on the user query.

After finding the search result based on cosine similarities, the user can select one game by typing game number. Game Finder uses user selected game descriptions to find three similar games from Metacritic dataset. Game finder uses content-based filtering to show three similar games. Content-based filtering is very similar to finding cosine similarities. The steps Game Finder uses are below:

· Game finder finds TF_IDF of training data by considering games title, publisher, platform, and genre.

· Game Finder also calculates TF_IDF of user-selected game description. Game Finder use games title, publisher, platform, and genre for user selected games.

· After finding TF_IDF of training data and user selected game, Game Finder calculates cosine similarities and shows three more similar games based on the user-selected game.

Challenging part:

Building a recommender system was easy compared to the other two features. However, I struggled at the very beginning of building a recommender system. At first, I tried to build a collaborative filtering method. But I couldn’t manage to finish implementing the collaborative filter method because of time. Then, I switch to a different algorithm for the recommender system which is a content-based recommender system. I also found visualizing the recommend feature. To overcome this difficulty, I read chapter 9 from the Mining of Massive Datasets (MMDS). I also read a blog article online. After understanding content-based filtering, I found it very easy to implements.

References:

To implement the recommender system, I used two sources, and these are:

· Chapter 9 from Mining of Massive Datasets (MMDS): http://infolab.stanford.edu/~ullman/mmds/ch9.pdf

· Online blog article: https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243

What I did differently?

I used these two sources to understand the concept behind content-based fileting. However, I didn’t use codes from any of the sources to implement. The blog post describes one example about Movie Recommending which was help very to understand content-based filtering. I developed this feature by utilizing code from the search feature of the Game Finder (phase I).

Google Sites

Report abuse