Content Recommendation has been a huge part of our life nowadays. Ranging from Amazon recommending products to us, to Music Streaming Services recommending new songs to our playlists, these internet services are trying hard to improve user experience, and of course, their revenues.
Music Recommendation System is a little tricky. There is an enormous amount of music present. Since people do not normally listen to music just to listen to music, it is hard to gather a specific ratings regarding songs from users. Music is often used as background and dressing for people's activities (e.g. reading, driving, sleeping) and played in a sequence. People would rarely stop what they are doing, and give ratings to songs they just listened every 5-10 minutes. Thus, the feedbacks from users for a specific song is implicit. An indication of what kind of music a user likes is the history of music/playlist the user has listened to.
The dataset we used was the Million Playlist Dataset from Spotify. As the name indicates, this dataset is consisted of a million playlists from Spotify users. More information about this dataset is shown in the EDA section.
Since this dataset is huge, more than 30GB after decompressing, and due to the limitation of our computational resources, we decided to use only 10% of data. We further split this sub-set of data into 80% training, and 20% for testing.
Before we dived into the project, we examined some literature works from the past, and took aspiration from them.
There are two prevailing approaches to recommend music, or anything: Collaborative Filtering and Content-based recommendation.
In Collaborative Filtering, the model takes advantages of listening history, e.g. songs an user has listened to or user's playlist, to recommend new songs.
Its underlying assumption is that people who listen to the similar music have similar taste of music as well. (There is always someone who listens to similar music as you do.) Thus, to recommend songs to a new user, it tries to find the most similar users in the matrix, and recommend songs based on these users.
One of the most popular methods in Collaborative Filtering is matrix factorization, developed by Koren, Bell and Volinsky [1] during the Netflix Prize competition.
In the context of music recommendation, the matrix factorization method starts with the construction of a data or interaction matrix Y, where Y[a,i] represents whether track i is in user a's playlist. If there are n users and m songs, Y has shape n × m. The matrix is extremely sparse. In the dataset we used, on average there are around 50 songs in a playlist, whereas they are millions of distinct songs available.
The task here is to fill in the blanks of this sparse matrix, so that we would know how an user would react to a new song and whether we should recommend this song to the user. This is done by a rank k decomposition of this matrix to U (n×k) and V(k×m).
A row in U represents a latent feature of an user, whereas a column in V represents a latent feature of a song. The dot product of U and V is X. The job is to find U and V so that X is similar to Y as much as possible.
The objective function of this task is:
which can be optimized using Alternating Least Squares or Stochastic Descent.
In this project, we first used a matrix factorization method developed by Maciej [cite], available in the Spotlight Python package, to build a recommendation system and made adjustment upon it. Then we developed a novel Neural-Network Embedding model that could make song-to-song recommendation with promising results.
A problem with directly applying user-based collaborative filtering is that we have millions of songs and playlists, and if we represent each song as a dummy vector of what playlists it belongs to, and similarity for each playlist, the matrix would be too sparse and the dimension way too large to be efficiently processed.
A common solution is to reduce the dimension of the matrix. We use a neural network to learn about the embedding of each song and playlist with respect to their "similarity". An embedding is a mapping of categorical variables to a vector of continuous numbers, that captures the similarity of entities in the context of our learning goal.
One of the most commonly used embedding is one hot encoding that maps one categorical variable with k distinct values to a k-dimensional dummy vector. For example, if there are k playlists in total, then the i-th playlist is mapped to the i-th unit vector. However, in this projected space, every point is of the same distance to others. We are not able to obtain much information about the relationship between the input points.
Here we want our embedding to represent the similarity of songs and playlists -- we hope to supply (song, playlist) pairs to the network, and make it learn the label, which is designed such that songs that tend to belong to the same playlists are mapped close to one another, and similarly for playlists that tend to contain the same songs. We will create separate embedding spaces for songs and playlists, and combine these embeddings via a second layer into a single number for our prediction task. To fulfill this supervised learning goal, we will assign the label to be 1 if the song is in the playlist, and 0 if otherwise.
After training our neural network, the resulting weights can be used to explicitly calculate the embeddings. Using these song and playlist embeddings, we can now compute the nearest songs for each song, and similarly for each playlists. Therefore, to recommend k extra songs for a playlist, we simply find the most similar playlists in the embedding space, and return the k most frequent songs that appear in these playlists, which are not already included in our own playlist.
As the name infers, Content-based recommendation system utilizes the content of the items explicitly. It is often difficult to come with a perfect future representation of these items because the way of selecting music is very different among people. One appropriate way is to hand pick these futures according to human habits. For music, the content could be the genre, artist, tempo, mood, or even the lyrics. Once we have these features in hand, we could design an algorithm that for a new user, it can predict how likely the user is going to listen to a new song based on the content of the new song.
In this paper [2], the authors proposed a content-based music recommendation system which was based on a set of attributes derived from psychological studies of music preference. The audio analysis data was obtained and used to develop a vector with 5 elements to describe tracks: Mellow, Unpretentious, Sophisticated, Intense and Contemporary (MUSIC). Combined with the rating history of users on tracks, the preference of users to these 5 factors can be calculated. Given the taste of a user, similarity of every track and the taste is ranked and gives the most similar tracks based on its content.
This approach has pros and cons. Rather than collaborative filtering, it can deal with tracks newly released but has limited appearing records, since all it needs to know is the attribute of the tracks. And it doesn't require data of other users so it can work for each user independently. However, it is not good at handling with users never met, since it needs rating history of the users. It may also recommend tracks too similar to user's history that sometimes can be expected and boring.
[1] Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 42.8 (2009)
[2] Mohammad Soleymani, Anna Aljanaki, Frans Wiering, Remco C. Veltkamp. Content-based music recommendation using underlying music preference structure. 2015 IEEE International Conference on Multimedia and Expo (ICME)
[3] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013)