The Expectation-Maximization (EM) algorithm is an approach for maximum likelihood estimation in the presence of latent variables. It is an appropriate approach to use to estimate the parameters of a given data distribution.
In order to come up with a rating system (since the user dataset has implicit data), we decided to use the distributions of hours played for each game with the EM algorithm.
In the figure below we find 5 groups of people with similar gaming habits and that would potentially rate a game TheWitcher3WildHunt in a similar way.
As it is visible from graph, most of those who played the The Witcher 3 stuck with it. However there were a few users where The Witcher didn't attract them and stopped playing after a few hours. The EM algorithm does a great job finding the groups of people with similar gaming habits and would potentially rate the game in a similar way.
If the distribution is denser, it shows that the majority of users are interested in this game.
We can see few users played ‘The Fallout 4’ game for very few hours. It’s possible some of these users lost their interest into the game shortly after starting playing it. The distribution is denser for groups 3 and 4. This shows that the majority of users are interested in this game. So the game like this would be highly rated.
A user-item matrix is created with the users being the rows and games being the columns. The missing values are set to zero. The observed values are the log hours for each observed user-game combination.
It is taking a large matrix and factor it into smaller matrices whose product equals the original one.
SVD is a matrix factorisation technique, which reduces the number of features of a dataset. We get the predicted values, but it differs significantly from actual values.
It is an improvement over basic SVD. At each iteration we try to reduce the error calculated by loss function(Here it is RMSE).
learning rate to 0.001 and the number of iteration to 200.
The plot shows that the SVD via gradient descent converges to zero on the train dataset, while the RMSE for our train dataset stays around 0.60 approximately .
We see that after the 75 - 100th iteration, the accuracy on the test dataset stops improving (the RMSE remains around the same value). The accuracy on the test data could be improved by using more leading components(latent factors), the trade-off being more computation time required.
RMSE calculated for basic SVD is greater than for SVD using gradient descent. So accuracy of prediction for SVD using gradient descent will be higher.
With the predicted user-item matrix, we will look again at the distribution of hours for both the games, using the EM algorithm in order to find a reasonable 1-5 star rating.
2-4 distributions look like they fit fairly well. The 5 on the otherhand is rather flat and only picks up the very end of the tail.
As we can see in the figure above, distributions 2-4 look like they fit the data fairly well. However, this is not the case for distribution 1. On the other hand, distribution 5 is pretty much flat on the right side on our figure.
We have used EM algorithm to understand the ratings for games, Basic SVD algorithm and SVD with Gradient Descent algorithm for Matrix factorization and generate recommendations for users.