The Million Playlist Dataset (MPD) is a dataset of 1,000,000 playlists that have been created by Spotify users containing playlist identifier as well as track listings and other related artist and album data. We used the unique track URI from this dataset to connect with the Spotify API and extract additional Spotify-generated audio features, from which we selected to use:
Missing Values
Since this was a curated dataset, there was no missing data.
Compiling Data
Given the very large size of our data, we chose to only look at 10 csv files (10,000 playlists) to make our EDA process more manageable and more time efficient. We assumed that the playlists were distributed across the csv files in a random manner so we selected the first 10 csv files (songs0.csv - songs9.csv) as our sample. First, we imported and combined the first 10 files of the MPD into one larger file and edited the pids of the songs in each csv file so that they are numbered sequentially. (In the original data, each csv file resets its pid value so that the first playlist is given a value of 0. Now, the second csv file has a pid value of 1000, the third 2000, and so on.)
We labeled this new file sample_data.
We were interested in understanding some basic descriptors, such as how many songs were in our sample and how large our playlists were.
We then took a closer look at the artists in our sample.
We found that there were a total of 35,637 artists in our sample, which is less than the total number of songs in our sample (664,712 songs). This meant that certain artists appeared multiple times in our sample and that they were likely to appear in more than one playlist. We were interested in quantifying just how many times artists were repeated in our sample and how often this occured. The histogram below shows that many artists appear more than once in our sample, suggesting that artists may have more than one of their songs featured in a single playlist or that one artist was featured in multuple playlists. Given the fact that a playlist has an average of about 67 songs and that the number of times in which an artist's name was repeated often exceeds this average, we think the latter is more likely. Assuming that artists are part of multiple playlists, it follows that some playlists may be similar to each other because artists often produce music in the same genre and with similar characteristics (we will take a look at exactly what these characteristics are below).
We also took a closer look at the tracks in our sample.
Of the 664,712 songs in our sample, we found that 132, 920 songs were unique. This indicates that ~80% of the songs in our sample are in more than one playlist, which suggests that the tracks in our sample data are widely popular for the most part and that the playlists in our sample data may have shared characteristics (similar genre, mood, etc.).
To better understand the overlap in the songs, we display the histogram below. This shows that songs in our sample are often featured in more than one playlist. Interestingly, there were no songs that appear only once. The majority of songs appear twice while the maximum number of times a song is repeated is 761 times (this song is Closer). This justifies the approach we use in our baseline models of recommending a song from one playlist to add to another playlist since this analysis confirms that it is likely for playlists to share songs.
Given the fact that roughly 80% of the tracks in our sample appear more than once, we considered cleaning our data by eliminating tracks that only appeared a certain few number of times. We believed doing this would eliminate very niche tracks from our sample dataset. However, since we are already working with a small sample of our initial dataset, even eliminating tracks that only appear once would result in us deleting a significant chunk of our data (~12% of tracks). Therefore, instead of eliminating niche tracks from our sample dataset, we will prevent our model from recommending niche songs (which have the risk of not suiting a user's preferences) by increasing the probability cut-off of our baseline models.
In the table to the left, the left column indicates the number of times a track appeared in our sample and the right column indicates the number of tracks that appeared those number of times. For example, there were 78,248 tracks that appear once, 18,861 tracks that appear twice times, and so forth.
We started with examining the distribution of features in our subset of data. Here, there is an interesting variety of skewness in the variables.
To get a better sense of our data, we also looked at pairwise relationships between our features. While we only had 13 features for prediction, we still wanted to check how related they might be or perhaps identify the most helpful predictors. We did not intend to perform feature reduction.
For the most part there did not seem to be strong relationships between our 13 features. However, we did identify a relatively strong positive correlaiton between loudness and energy as well as a slight positive correlation between valence and danceability.
We further explored some of the feature attributes that had discrete values to more specifically understand how one feature (and the different values that this feature takes on) relates to the others. We looked at time_signature and mode.
Given the skew observed in the data, we decided to color the pairplot by time_signature, the only discrete variable with fewer than 10 levels. We now see distinct distributions among all predictors.
We suspect that mode (the other predictor with fewer than 10 levels) would not be as interesting of a predictor to facet the density plots by. We expect similar distributions for most audio features, since a song’s modality (whether or not it is major or minor) is too broad of a characterization and does not prevent an artist from deciding to make it louder or more acoustic, confirmed below.
We then looked at how these distributions change when songs are grouped by their playlists. Now we see the most variation in danceability, energy, key, and loudness among the playlists. This makes sense, since we also naturally divide songs into playlists by similar parameters for different purposes (high focus studying, relaxing at night after a stressful week, celebrating the end of finals, etc..). The playlists in this subset were pretty similarly distributed in a lot of the other features, save for a single outlier in the case of acousticness, liveness, valence, instrumentalness, and tempo).
We also take a look at the attributes of songs in a single playlist. The playlist below contains 19 songs, each which is represented as a blue dot on the plot. We find that the songs in playlist 1449 have quite distinct attributes. While the values for acousticness, instrumentalness, liveness, tempo, speechiness, and time signature seem to have a small range, the values for other attributes are widely dispersed. The table below also confirms this difference in the distribution of values for each of the features. This is somewhat unexpected since we would expect a playlist to consist of songs that have similar characteristics in most dimensions. However, this does not impact our approach since we are more interested in using the similarities between a playlist's overall attributes to recommend songs and not similarities with specific songs in a playlist.
Through our EDA we were able to gain insight into our Spotify data and to uncover relationships between the track attributes of our playlists, which we will use to predict songs in our models. We were also able to test our assumptions that playlists can be defined by their distinct feature attributes and to thereby justify our approach of using the values for these attributes as indicators of whether a song ought to belong to a given playlist. However, we also note that there seem to be some similarities across playlists since some of the distributions of feature attributes were similar for the 20 playlists we examined in our EDA. This justifies our approach of using clustering to improve our models.