Overview
For our project, we sought to create a model that helps Spotify users extend their playlists by recommending songs with similar features. We contacted the Spotify API to get information about the feature attributes of each of our songs (danceability, speechless, tempo, etc.). We began by creating 4 baseline models: a logistic regression model, KNN model, Random Forest Model, and Ada Boost model. On top of the fact that these baseline models produced low train and test accuracies, we identified some disadvantages associated with using these models; namely, that our models can only recommend playlists that they were trained on and that we were dealing with a large number of classes since each playlist id was a class. Our solution to these problems was to turn to clustering. We clustered out playlists based on commonalities between the feature attributes of each playlist. We applied two iterative methods to determine that we would use 34 clusters to group our playlists. Using clustering we took the following steps: 1) Given a playlist, we determined which cluster it belongs to. 2) We tuned our baseline classification models and applied these with the addition of a neural network to determine the probability that a song belongs in the aforementioned cluster. 3) We found the song with the highest probability value and determined that it was the best song to recommend for that cluster. We tested our model on our own playlists and found the model to successfully return songs that we enjoyed.
Project Statement and Motivation
In our initial proposal, we expressed interest in improving Spotify’s daily playlist generation method by more heavily considering a user’s general preferences than his/her recently played songs. However, we realized that we do not have information about the dates/times on which a user plays certain songs. Therefore, we revised our project to instead create a model that extends Spotify’s playlists by recommending new songs - a model that mimic's Spotify's automatic playlist generation system.
As Spotify users, we all were interested in understanding how Spotify understands our habits of listening to certain songs, artists, and playlists as well as how Spotify succeeds in recommending us songs that we enjoy. Aside from these personal interests in learning how Spotify operates, we also took note of the business-related incentives for Spotify to create a good algorithm for recommedning songs. Namely, as the number of MRS on the market increase, it becomes increasingly important for Spotify to make successful recommendations in order to maintain their audience and improve user experience. Therefore, we became interested in two questions: How does Spotify recommend songs we like? How can we help them improve their methods? To answer these questions, we looked at the Million Playlist Dataset (MPD) and Spotify's public API on audio features, which characterizes songs according to 13 variables. Preliminrary EDA revelaed that different playlists do tend to have distinct values for these 13 variables, which justifed our approach of using the audio features of songs in a playlist to predict whether a new song ought to be added to the playlist. Our end goal was to build models using the MPD and data from the Spotify API that recommend songs that fit our taste, just as Spotify successfuly does for us.
Description of Data
We worked with the Million Playlist Dataset (MPD), which is a dataset of 1 million playlists. While 1 million playlists seems like an insurmountably large number, it is only a small fraction of the total number of 2 billion playlists on Spotify. In fact, the MPD was sampled from this total of 2 billion public playlists on Spotify and "consist[s] of over 2 million unique tracks by nearly 300,000 artists. " The playlists in this sample were created between January 2010 and November 2017. The MPD includes information such as the playlist id, position in playlist, track information, and artist information for each song in the dataset. Below is the full list of variables for each song:
Spotify notes that its users create playlists for a myriad of purposes, including parties, dinners with friends, roadtrips, or study sessions. The playlists that users create are also categorized in unique ways such as by genre, artist, mood, culture, or occasion. Based on the user data that Spotify collects, there are also playlists such as "On Repeat," "Your Top Songs," or the Daily Mixes that are catered specifically towards individual users. The diversity of considerations that go into creating Spotify playlists makes us interested in learning about the characterstics of each of these playlists and to understand how Spotify can keep users curious and happy by suggesting new songs that match user preferences.
Literature/Related Work
An interesting aspect of taking on this project was that we were able to reference the 2018 RecSys Challenge, which was a competition in which participants submitted a system that completed Spotify's task of automatic playlist continuation. Our approach was different from this challenge since participants submitted a list of recommended tracks that can extend a given Spotify playlist whereas we approached the same objective of extending a playlist from a different angle; given a song, we determined which playlist the song ought to belong to. Nevertheless, we found it helpful to read through some of the code submitted by the winning participants to get an idea for the types of models they found useful as well as to understand the more technical aspect of contacting the Spoitfy API.
1. https://recsys-challenge.spotify.com/overview
2. https://recsys-challenge.spotify.com/static/final_main_leaderboard.html
3. https://github.com/proto-n/recsys-challenge-2018/blob/master/4_0t.py
4. https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/
Much of our research also focused on becoming more familiar with building models, in particular methods for tuning models, building neural networks, and clustering. Here, we found some articles on medium to be extremeley helpful as well as some pages on StackOverflow (these are unlisted since the list would be too extensive).
1. https://towardsdatascience.com/understanding-neural-networks-19020b758230
2. https://www.datarobot.com/wiki/tuning/
3. https://towardsdatascience.com/an-introduction-to-clustering-algorithms-in-python-123438574097
4. https://scikit-learn.org/stable/modules/clustering.html
5. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Apart from these online resources, we also references class lectures, homework assignments, and labs.