We set out to create a model that would help a Spotify user extend their playlists with relevant song recommendations that are similar to what is already present in said playlist. We took the Million Playlists Dataset and used subsets of this data to fit baseline models (logistic regression, KNN, random forest, AdaBoost) to predict the playlist to which a song might belong by sorting predicted probabilities by from most to least likely. This naturally leads to predictions that are solely based on the data upon which the model was trained. In other words, if the model was not trained on a particular playlist, then it would predict a song to be included in that playlist.
To improve upon this, we decided to use clustering to group playlists sharing similar features. Now, our models can handle new playlists by first identifying which cluster it belongs to, and predicting songs using that cluster. This naturally led to improvements in both the training and test data used, which is encouraging. However, what does that look like in practice?
A Sample Case Study
We ran an incredibly preliminary pilot study where we applied our 8 models (4 baseline, 4 based on clustering) on 4 of our own playlists, all created by the same author but for a variety of purposes. The author then rated the predicted songs by how likely they were to actually incorporate that song into the corresponding playlist.
So how did the models fare in practice?
Unsurprisingly, there was no overall model that was clearly superior in its predictions.
All models predicted the same songs for the playlists 1 and 2, both relatively large and created for similar purposes but contained songs with arguably different features, especially in terms of acousticness, instrumentalness, and speechiness. This is particularly interesting in the non-cluster based models, but not unexpected for the cluster-based models. However, since both playlists were created for different purposes and different moods, the “best” models were different, though both adaboost models (raw and clustered data) performed well here. Recommendations for the subsequent playlists were much more underwhelming. The recommendations for playlist 3 were particularly disappointing given the uniformly warm timbre the playlist was built around. We were interested in seeing what types of songs would be recommended for playlist 4 which contained a small number of bouncy tunes (low acousticness, high energy, high danceability). However, most of the suggested songs were much slower tempo and featured strings and piano, with a few completely instrumental suggestions by adaboost models.
Model Structure Assumptions
The crux of our cluster-based models is the assumption that cluster prediction accuracy is analogous to recommendation relevance. The most immediate limitation of this assumption is that any given model only has as many sets of unique recommendations as there are clusters. In short, the more we reduce the dimensionality of the problem (via lowering the number of clusters), the more we limit the generalizability of our model. We face a constant trade-off between model assumptions and model results.
The only way to relax this assumption (without changing the structure of the model) is to include more data. This leads us to our next limitation: computation. All code was run on our laptops. This severely limits our ability to include more data in our analysis. Several of baseline models took a substantial amount of time to fit. The cluster-based models took a smaller, though non-trivial, amount of time. In the future, access to a computing cluster might result in more representative clusters, better fit models, and more relevant predictions.
We are also interested in other methods of clustering playlists. It may be worth manually clustering playlists and using these clusters to validate data-driven clusters. Such data collection is costly relative to the scope of the project. Furthermore, we plan on looking into other clustering algorithms (hierarchical and spectral clustering).
Features
The performance of our models may suffer due to a lack of predictive ability; we may lack relevant predictors. For instance, song popularity and publish date may very well influence a track's relevance in any given playlist. Overall, the track features from Spotify are sufficiently representative (containing tracks of all genres and so forth), however, we may miss key features.