Since there are billions of users and playlists, and the playlists are created and changing all the time, it is extremely time consuming and quite unrealizable to use last.fm api to get all the latest tags of all the songs in order to recommend similar songs with respect to the playlists. Besides, there are many unpopular and newly released songs that might never be recommended to any user, as they might have no reliable tags. In this case, we can make a pool for those newly released songs (via released date from Spotify API) and unpopular songs (via popularity from Spotify API) and then use our model (artificial neural networks model) to train them. With the pool of predicted tags for these songs, we can compute the distances between these songs and the playlists. And we can recommend songs in this pool with the closest distances to the playlists.
For the purposes of this project we assumed we were addressing the following problem. We are creating a a song-recommender system for Spotity. We assumed the user has a playlist on Spotify that contains a variable number of songs, and it is our job to recommend new songs to go with that playlist. We have approached this problem in two ways:
Here our aproach has been to generate a set of tags that describe qualities of the songs on the input playlist and to then use these tags in a k Nearest Neighbors (kNN) classifier to identify similar songs that we would recommend. As tags we used a set of the most common tags assigned to songs in Last FM's Million Song Dataset: rock, pop, alternative, dance, love, favorites, for example. We chose to work with the most common tags in the Last FM Million Song Dataset, those that appeared at least 3000 times, which amounted to 274 tags in all. For a given playlist, one approach to our classification problem then would be to look up the playlist's songs' tags in the Million Song Database and use some average of these tags in a kNN classifier to make recommendations. We found, however, that this approach was impractical, as the process of aquiring the tags was very slow, and also limiting, such an approach does not work with less popular songs or newly released songs that do not have tags in the Million Song Database. We therefore decided to take a two stage approach. We knew that for any song on Spotify we would have rapid access to a standard set of audio features that Spotify compiles for each song, 13 Audio features in all. Further, we guessed that these features would be good predictors for the Last FM tags of most songs, so we built neural networks to map Spotify Audio Features to Last FM tags. Then, we used the tags predicted by the network in a kNN model to make recommendations. A diagram of the general scheme is shown below.
Fig 1: Approach 1, Two-Stage Modeling Strategy
In attempting to construct a neural network that performs well, we built two types of networks, a convolutional neural network (CNN) and a more standard artifical neural network (ANN). Both approaches are described below. In the end we settled on the ANN as the best performer.
We used a CNN model to train on the features of playlists ,and get tags for each playlist. And later in the KNN model, we would use the tags to do recommendations.
With all 100 playlists we random sampled from the Million Song Database, we selected playlists with more than 30 songs. And for each playlist, we selected the first 30 songs in order to keep a stable dimension for the input to the CNN model. Next, we got 13 acoustic features for each song in each playlist. Now, the input is 66 playlists with 13 acoustic features for each song in these playlists. And the labels are all 274 tags for each song. (Same tags and same procedures of getting tags as before).
Now, the input for the CNN has a dimension of (66, 30, 13, 1), and the output has a dimension of (274, 1).
For the CNN model, after several trials and tests, we are settled with the following structure.
Input Layer: (66, 30, 13, 1)
Conv2D Layer: kernel (3, 3), stride 1, no padding, ReLU activation => (32, 30, 12, 1)
Conv2D Layer: kernel (3, 3), stride 1, no padding, ReLU activation => (64, 30, 12, 1)
Maxpooling Layer: size (2, 2), stride 1 => (64, 15, 6, 1)
Dropout: 0.25
Flatten Layer: => (64 * 15 * 6, 1)
Dense Layer: ReLU activation, 128 nodes.
Dropout: 0.20
Output Layer: multiple Sigmoid activation, => (274, 1) where 274 is the number of tags (categories)
Now, after training, the CNN model would be able to predict the tags for each playlist. Later, with tags for the playlists, we would recommend songs with a KNN model.
With the 100 playlists we random sampled and then 66 playlists which have more than 30 songs in them. We did a train-test split with 80% training and 20% test set. The test accuracy of the CNN model is averaged at around 51.37% (will fluctuate across multiple runs). Compared to the next ANN we have, the ANN model has a much higher accuracy and higher performance on other measuring metrics. So after investigating both CNN and ANN models, we decided to train the tags with the ANN model.
Fig 2: model accuracy and model loss
A diagram of the ANN we used is shown below. It was built with Keras in Tensorflow. All layers were feedforward only and fully connected. Its input consisted of 13 Spotify audio features that were the averaged features of all songs on the input playlist. A detailed description of these featrures is given on the "Data Aquisition" page. There were then three hidden layers, 20, 40, and 64 nodes respectively, and then an output layer consisting of 274 nodes representing the 274 FM tags. The output layer's activation function was a sigmoid such that each output node calculated the probability that a given tag was highly represented in the tags of the input songs. These probabilities were converted by a 0.5 threshold to binary values (0 or 1). For training the true last FM tags of the input songs were used. The network was trained using binary cross entropy as the loss function with an ADAM optimization alogorithm. The top 20 tags that appeared most often in the tags of the input songs' tags were given a value of 1 and the rest a value of 0. The network was trained on 80 playlists and tested on a 20 separate playlists. The networks accuracy in predicting the most relevant tags for a given playlist was found to be 94.18%. Thus, this approach successfully maped Spotify Audio features to Last FM tags.
Fig 3: Our Artificial Neural Network
The next stage in modeling was then to build a kNN model to which the predicted tags of a playlist could be compared. This was done by randomly selecting 100 playlist from the Million Playlist Dataset. Together these playlists contained 6803 songs. The Last FM tags for these songs were aquired from the Last FM API, and these songs were represented by their tags in 274 dimensional space. The values of the tags were normalized to each range between 0 and 1. Then, the tags predicted by the ANN for a given playlist were plotted into this space, and the surrounding N songs, as judged by euclidean distance, were chosen as recommendations to add to the input playlist. We imagine N, the number of recommended songs, could be adjusted by the user. Further, we also built separate kNN models for newly released songs (kNN with 690 somgs) and unpopular songs (kNN with 512 songs). The tags for these songs were not acquired directly from Last FM, as they do not exist, but rather predicted by the ANN from Spotify Audio features. With these three kNN models we imagine that in response to a given input playlist the user could be returned several popular songs from the main kNN model and as well one or two newly released or less popular songs recommended by the other two specialty kNN models. Thus, the output can be taylored to the users taste. For example of a playlist and the songs recommended based on that playlist see the Results / Interpretation 1 page.
On top of the model constructed above we worked on developing another model based on mood of the song, since the previous model utilizes tags of the songs and the audio features but does not take into account the mood of the song for generating recommendations. For this approach, we will utilize tags of the songs, lyrics of each songs extracted from Genius. Also, we will introduce the concept of "mood" which is pre-defined as 10 moods shown below.
These 10 moods were inspired by a literature review we did here. Based on the 10 moods described above, we also picked tags obtained from Million Song Data Set which best matches the 10 moods. You can see below for the specifics.
The associated tags picked from the Million Song Dataset:
Using +17,000 songs coming from 300 playlists sampled from the Million Playlist Dataset, we obtained tags for each of the songs using Million Song Data Set. Based on the obtained tags for each of the songs, we compared that tag with Fig 4 and determined the weights for the tags, defining the max weight tag which represents the mood of that particular song.
Having defined 1 mood for each song, we then went on to Genius Lyrics to obtain lyrics for each of the songs. After getting the lyrics, we dropped the songs with 1) no lyrics and 2) no mood represented to the song. As a result, we ended up having 7094 songs to further try 2 separate approaches. With the 7,094 songs, we split this data into train-test by 80:20.
To reference the code, please view here. (All models included)
The first approach we took was to use tf-idf. tf-idf stands for "term frequency inverse document frequency" and it is often used to determine the most important words that appear in certain set of text but also penalize non-trivial words (i.e. "the", "it", "I" etc...) that appear frequently.
After performing tf-idf to the lyrics of 7094 songs, further we tried the following 3 methods to predict the mood of the songs based on the content of lyrics. The details of each of the approach, the respective accuracy rate, confusion matrix, and precision-recall curve are shown below:
Logistic Regression
Random Forest (tree = 1000, max depth = 25)
SVC
In addition to tf-idm, we also tried using word2vec. After applying word2vec to the lyrics data collected, we tested 2 methods to predict the mood of the songs based on the content of lyrics. The details of each of the approach, the respective accuracy rate, confusion matrix, and precision-recall curve are shown below:
Logistic Regression
Random Forest (tree = 1000, max depth = 20)
Looking at the results above, we will use word2vec with random forest to make the recommendation.
The final recommendation Approach 2 will generate 20 songs by picking one particular mood.
You can see the sample results for both approach 1 and 2 in the following tabs.
To reference the code, please view here. (All models included)
Below is the summary of approaches we have taken.
We finally settled on the Approach 1-2 and Approach 2-2.