Schafer, J.B., Frankowski, D., Herlocker, J. & Sen, S. (2007)
Today, most of the successful recommender systems use Collaborative Filtering (CF) process to filter or evaluate items using the opinions of other people. The basic ingredients to build a CF system are: users, items, user’s list of items with associated opinions (ratings), a metric to measure similarity, a method to predict a rating for unrated items, and a metric to evaluate the performance.
A CF system relies on past user behavior to produce predictions and recommendations. Ratings may be gathered through explicit means, implicit means, or both. Explicit ratings are those where a user is asked to provide an opinion on an item, whereas implicit ratings are inferred from a user’s actions. CF algorithms can be separable into two classes: (i) memory-based: requires all ratings, items and users stored in memory; (ii) model-based: periodically creates a summary of rating patterns offline. Because memory-based do not scale well for real-world applications, majority of the algorithms are model-based or a hybrid of the two. Either of these models can be non-probabilistic or probabilistic depending on the application, but non-probabilistic are more widely used. Accuracy is the most prominent evaluations metrics used to measure the system’s predictions. The standard method is mean absolute error, which computes the average absolute difference between the predicted rating and the actual rating given by a user.
User-based vs Item-based Nearest Neighbor Algorithms (non-probabilistic)
The most well-known CF algorithms are the nearest neighbor algorithms which identify similarities between users/ items. The two different classes: (i) user-based: generate predictions for users based on ratings from similar users (ie. neighbors); (ii) item-based: generate predictions based on similarities between items. User-based nearest neighbor algorithm captures how word-of-mouth recommendation works, however, the ratings data is often sparse and calculating a user’s perfect neighborhood is expensive (ie. comparison against all other users). In theory, the size of the model for the item-based nearest neighbor algorithm could be as large as the square of the number of items. In practice, the size can be substantially reduced by only retaining the top correlations for each item.
Bayesian-network (probabilistic)
Probabilistic algorithms try to leverage well-understood formalisms of probability to predict ratings or ranked recommendation lists. The most popular probabilistic framework involves Bayesian-network models that derive probabilistic dependencies among users or items. Probabilistic clustering/ dimensionality reduction techniques have also been used to estimate latent classes. The advantage of the probabilistic algorithms is that in addition to computing the probable rating, the likelihood of that rating being correct is know, thus capturing the algorithm’s confidence.
Prediction vs Recommendation
It is important to know that Prediction and Recommendation tasks place different requirements on the CF system. To recommend an item, the system does not necessarily require to know about all the items (ie. can be a subset). Whereas to provide a prediction for a particular item, a system must store information about every item (including rarely rated ones). Hence, personalized predictions for many items often have larger memory requirements.
In 2007 the authors concluded that CF will be one of the core systems that drives the personalization technologies. However, they acknowledged the complexity with filtering information such as aesthetic taste will continue to require people in the loop, in order to condense opinions into data so it can be more easily processed by software – ratings.
When Netflix Prize was announced in 2006 to give away $1 million to anyone (team) who can improve the accuracy rate of their existing system (Cinematch) by 10%, the interest in Recommender Systems surged. Every year a progress prize of $50,000 was awarded to the top team providing they improved the existing accuracy rate by at least 1% over the previous progress prize winner (or Cinematch), and if no team has won the grand prize. By 2009 the “BellKor’s Pragmatic Chaos” team**, a merger of teams “Bellkor in BigChaos” and “Pragmatic Theory”, won the grand prize with a final combination of 107 algorithms that achieved a 10.05% improvement over Cinematch (Amatriain, 2013).
The underlying algorithms with the best performance in the ensemble was Matrix Factorization and Restricted Boltzmann Machines, a linear combination of the two reduced the error to 0.88. Therefore, they were the only two algorithms that was put into production after overcoming some limitations, and is still part of Netflix’s recommendation engine today. (Amatriain, 2013)
**The respective group's paper on their winning algorithms can be found in the links: BellKor, BigChaos, Pragmatic Theory.
Koren, Y., Bell, R., & Volinsky, C. (2009)
Two primary areas of collaborative filtering (CF) are neighborhood methods and latent factor models. Neighborhood methods are centered on computing the relationships between items, or between users. In contrast, the latent factor models try to explain the ratings by characterizing both items and users by a given number of factors inferred from the rating patterns. Factors that measure the obvious dimensions could be comedy versus amount of action, whereas the less well-defined dimensions could be the depth of character development, and some dimensions could be completely uninterpretable (by humans).
Matrix factorization (MF) technique has been one of the most successful realizations of latent factor models. Figure 2 illustrates a simple two dimensions example of the latent factor model - the item is measured by the extent to which it possesses those factors; the user is the extent of interest they have in items that are high on the corresponding factors. From this basic MF model, a user’s predicted rating for an item is captured by the user’s overall interest in the item’s characteristics, or equally the dot product of the vectors (item's and user's) on a set of factors in that space.
The major challenge to this model is computing the mapping for each item and user to factor vectors. Singular value decomposition (SVD) is a technique that is well-established for identifying latent semantic factors. However, SVD does not work well with the conventional CF system that often has a high portion of missing values (sparseness in the ratings). Carelessly addressing the few known entries would be highly prone to overfitting. During the Netflix prize competition “Funk’s SVD”, popularized by Simon Funk, resolved this by decomposing the original very sparse matrix into two low-rank matrices that represent user factors and item factors. This is done by using an iterative approach to minimize the loss function, and the most common learning algorithms for this is stochastic gradient descent and alternating least squares (ALS). Stochastic gradient descent is relatively easier and faster than ALS, however, ALS is more favourable when the system can use parallelization, and if the system is centered on implicit data.
Another challenge is to model biases accurately. Biases are the variation in rating values that is independent of any interactions or associations with either users or items, and capturing these are vital. In addition, temporal effects also plays an important role as product perception and popularity is constantly changing with new selections emerge. Therefore, item biases, user biases, and user preference should reflect the dynamic nature of user-item interactions.
Figure 4 demonstrates the importance of including these influences and the numbers of parameters use to yield the optimum root mean square error (RMSE). In general, by increasing the number of involved parameters improves the accuracy. However, this is equivalent to increasing the factor model’s dimensionality. It also clearly illustrates that a more complex factor model, such as one including temporal components increases the performance significantly.
In summary, the Netflix Prize competition has shown that the matrix factorization techniques were superior over the classical nearest-neighbor techniques for large datasets systems, where it has become a dominant methodology with collaborative filtering recommenders. Since it is also a compact memory-efficient model systems it can easily integrate with other crucial aspects such as multiple forms of feedback, temporal dynamics, and confidence levels into the one model.
Salakhutdinov, R., Mnih, A. & Hinton, G. (2007).
In 2007 the authors criticized the existing collaborative filtering methods were incapable of handling very large datasets. Various Restricted Boltzmann machines (RBM) models were experimented in the study, and the selected models applied to Netflix’s dataset achieved an error rate that was over 6% better than Netflix’s own system.
RBM is a two-layered neural network, with one visible layer and one hidden layer. The visible units are movies that a user has rated, expressed in K X m matrix (where K = 5, the number of star ratings and m as a number of rated movies). It is “visible” because the state of these is observed (known), unlike the hidden units which are user features that are yet to be determined with an unknown state. Each unit has a binary (on/off) states like the neurons in the brain depending on connected neighbors. Each visible unit is connected to all the hidden units and vice versa (each hidden unit is also connected to all the visible units). However, there are no connections between units within a layer (no visible unit is connected to any other visible unit). There are symmetric weights between all the connections, and biases one for each hidden unit and 5 biases (K = 5) for each visible unit. In addition, there were three adjustments made to the model in order to apply to the Netflix dataset.
First, a common problem with collaborative filtering is sparse datasets (in Netflix’s data mostly missing ratings). The authors used a conditional multinomial distribution for each column of “visible” matrix, and a conditional Bernoulli distribution for modeling “hidden” features to ensure movies with missing ratings do not make any contribution to the model’s function. Secondly, the computational time required for learning, if done analytically would be quite slow with high variance. An approximation learning method was adapted to overcome this, called Contrastive Divergence (Hinton, 2002), which is much more efficient and greatly reduces the variance of the estimates for learning. Lastly, making Kn evaluations for each user to predict a rating for n movies, requires large computational time. Therefore, a trade-off for speed from accuracy in prediction was chosen and a one iteration of the mean field equations used in the experiments.
From the baseline RBM model, three various RBM’s were developed and performance test with the validation dataset: (i) Gaussian Hidden Units; (ii) Conditional RBM’s; (iii) Conditional Factored RBM’s. A conditional RBM takes the extra information in the test set about a user into account by knowing in advance which movies they viewed even if these ratings are unknown. In the RBM models, there is a large number of free parameters which makes learning problematic with large datasets, so this was addressed by factorizing the parameters into a product of two lower-rank matrices.
Comparing these models, Figure 3 reveals that conditional RBM significantly improves performance over the baseline RBM model, where the factorizing parameters is crucial to converge faster with large datasets. In addition, a comparison with SVD model was done and the conditional factored RBM slightly outperforms SVD. However, the errors made between the RBM and SVD varies significantly, and therefore several different versions of each method were linearly combined to produce the best predictions.