As we enter the project, we have done some exploratory data analysis. In the interest of time and resources, we have analyzed two datasets: Million Playlist Dataset and the data available from Spotify API.
For this dataset, we tried to visualize two things. The distribution of the length of the songs, and the distribution of number of songs in each playlist. For computational purposes, we have analyzed the first 6 csvs for analysis.
We can see that while the overall distribution of songs are highly similar for each for the sample csvs, there is most likely an exceptionally long song that exists in the playlist. For instance, in sample 4, the longest song that exists in the playlist is over 5,000,000 milliseconds = 5,000 seconds = 83.3 minutes. However, this is an outline, and looking at the distribution of the violin plot, these outliers are minuscule in number, and probably will not affect the resulting model substantially.
In the left plot, we can see that for each csv files, majority of the playlists have songs less than 75, and there is no playlist that has a song with more than 250 songs. Also, the mean of the number of songs in a playlist is depicted with a red vertical line for each plot. From this, we can see that the mean, which is approximately 60, is consistent throughout the sample csv files.
The code for this EDA process can be viewed here.
The following Audio Features were downloaded via the Spotify API for the songs on the first 100 playlist in the Millon Playlist Dataset, 5387 songs in all. These features were used as inputs to the artificial neural network we have developed that maps Spotify audio features to Last FM song character tags. The following 13 Spotify-defined features were used: acousticness, danceability, speechiness, instrumentalness, liveness, energy, valence, duration_ms, key, mode, loudness, tempo, and time signature. Below are histograms for each of these features. As is evident many features are defined to be on a 0 to 1 scale, which is well suited for input to the neural network. Some of the features' distributions are fairly symetric, Dancebility and Valence for example. Others such as liveliness and acousticness have skewed distributions that may perform better if their x dimensions are log, -log or sqrt transformed before input. Also evident, about a third of the songs are in a minor key, and two-thirds are in a major key, so both major and minor modes are well represented, and so are the musical keys themselves (C, C#, D, D#, E, F, F#, G, G#, A, A#, B) —although there is a clear preference for the key of C. Most songs are between 3 and 5 minutes long and have tempos of between 100 and 150 beats per minute.
Fig 2: Histograms of the Audio Features of the songs on the frist 100 Spotify playlists. These features served as the input to our neural network. The features were extracted from the Spotify API at: developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/
For this EDA, the code can be viewed here.
The data of tags and the associated weights of tags were downloaded via the Last.fm API for the same songs that we downloaded the Spotify audio features. The weight for each tag is ranged from 0-100. We only selected the tags that have more than 3000 occurrences in the last.fm_unique_tags file provided by Last.fm so that all those tags are associated with enough number of songs. So there are 274 tags in all and the weights of these tags were used as multiple-label outputs to the artificial neural network we have developed.
For this EDA, the code can be viewed here.