A problem related to song recommendations is predicting ratings. For example, this could be some set of movies that some number of people are giving ratings (1-5). The term "sparse matrix" comes from this matrix of movies by raters, filled in with ratings, that is mostly sparse (99%). This means that most movie and user interactions are unknown, and the goal would be to predict what those missing ratings would be. One of the ways to do explicit rating predictions is to factorize this ratings matrix.
A ratings prediction algorithm learns latent factors in the matrix . One method by which this algorithm does this is by SVD, which stands for Singular Value Decomposition. This is akin to performing PCA on the users and the movies, but this is one way of performing matrix factorization on matrixes with explicit ratings.
In our case, we have implicit ratings, meaning that users don't rate songs on any scale, but the preference of a user is inferred from the presence of a song on a playlist. The loss function for Weighted Regularized Matrix Factorization is as follows:
In the WRMF loss function, the preference matrix (p_ui) indicates users' preferences, that is if a song is in their playlist or not. We make the assumption that if a user has interacted at all with an item, then p_ui = 1. Otherwise, p_ui = 0. The user vector is represented by x_u and the item vector is represented by y_u
The confidence matrix is represented by c_ui, and it roughly describes how confident we are that user u does in fact have preference p_ui for item i. The last two terms are regularization terms so that we do not overfit. In using this loss function, we aim to minimize the square of the difference between all preferences in our dataset and our predictions.
In our particular version of matrix factorization, we perform ALS minimization. This means that we hold one set of latent vectors constant, either the users or the items, and then we take the derivative of the loss function with respect to the other set of vectors. We set the derivative equal to zero to search for a minimum and solve for the non-constant vectors. Then we keep these solved-for user vectors constant and take the derivative of the loss function with respect to the previously constant vectors. We alternate back and forth until convergence , hence the name Alternating Least Squares minimization.