The influence of recommender systems is omnipresent in our daily lives. Targeted ads use information about our web browsing patterns to show us content that is most likely to appeal to each of us as unique consumers. Netflix keeps track of what shows and movies we watch in order to show us more individualized content which we might enjoy, which is why after binging a whole season of The Great British Bakeoff, you may find more cooking shows in your Netflix suggestions.
Music recommender systems work similarly to suggest songs that are similar to the user's playlist. Building an effective music recommender system is difficult and requires much data since there is so much information in audio features and lyrical content and the reasons behind a playlist vary from user to user, but songs themselves are quite short (on the order of 3-5 minutes vs. 90 minute movies or hours of web browsing). Additionally, the order in which songs are consumed matters, and the context in which songs are listened to may change the user's preference as well.
Using Spotify's Million Playlist Dataset (found at https://recsys-challenge.spotify.com/ ), we set out to build a model that could recommend relevant songs based on the songs already in a playlist. This dataset is comprised of 1,000,000 playlists of varying length, created by different individuals, split up into 1000 .csv files. Each row in the .csv files corresponds to a song, and contains information about the ordinal value of the playlist to which this song belongs in the .csv file, the position of the song in that playlist, the artist of the song, the song title, the album of the song, the song duration in milliseconds, and the unique Spotify URIs of the song, artist, and album. The first step that we took was to get the data into a single .csv file that allowed us to perform operations on the whole dataset at once rather than file by file, and we added another column that denoted from which .csv file the playlist originally came. This ends up being an enormous dataset with 66,346,428 rows. Due to the sheer size of this dataset, we had problems with memory and storage throughout this whole project. Even working on Google Colab, we consistently ran out of RAM while trying to perform computations on the whole dataset, so we ended up cutting the entire dataset in half.
Our initial exploration of the dataset revealed that this is an almost complete dataset with no missing unique Spotify identifiers, but with a few missing items that were likely due to missingness in the Spotify attribution of song and artist. For example, a few classical songs and indie rap songs had “N/A” as an artist as well. In addition to real missingness, there were also a few real artists that used “NA” or “null” as their name and an album called “N/A.” For matrix factorization, we need only a unique identifier for each song, and that seems to be best handled by the Spotify song URI, so the missingness in the artist names does not impact this method very much. To deal with this missingness anyways, we changed the real N/A’s to NaN’s and the names to strings and did not drop those rows.
We also found that playlists with fewer songs are more frequent than playlists with many songs (>100 songs)...
...and most songs in the playlists fall within 2 to 6 minutes long.
This is in line with average listening habits--most people don't make very long playlists, and most songs are 3 to 6 minutes long. We also found that there are certain artists are much more popular in the dataset. Drake appears at an impressive 20% frequency in a single playlist and various other artists like Kendrick Lamar and Rihanna appear at about similar frequencies.