Our data consists of the Million Playlist Datasets (MPD) and the Track Audio Features Data from Spotify. The first part includes data of literally one million playlists recorded by Spotify, describing track names, artist names, album names, the corresponding uri for each track, and duration of the tracks. These playlists were randomly selected from all playlists within the following criteria:
The original MPD is huge (up to 30GB) so we decided to use 10% of the whole datasets for analysis (100,000 playlists), given the limited calculation power and memory. We scraped with given track uri and kept the reasonable features accordingly.
Here is an overview of the dataset's structure:
In the transformed dataset:
pid
is the id of playlist (0-999); pos
is the position number of tracks in the playlist; artist_name
, track_name
and album_name
are names of these elements;artist_uri
, track_uri
and album_uri
are the uri in Spotify website to identify unique artist/track/album;duration_ms
is the length of a track.This dataframe has 6,677,800 rows and 9 columns. However, we found that some songs may appear in the same playlist multiple times (with identical uri). After excluding these duplicates, we found that there were 6,589,079 valid track records, 3,812,169 valid artist records and 4,978,987 valid album records.
We then calculated how many tracks/artists/albums are there per playlist and inspected the distribution.
Here is an overview of the dataset's structure:
Here are the distribution of these variables:
All these distribution are right-skewed. If there is no "250 tracks" restriction, the distribution may have even longer tail.
Using uri to avoid identical names, we found 681,805 unique tracks, 110,063 unique artists and 271,413 unique albums among these 100,000 playlists. We also wanted to know how popular these songs and artists are. We measured popularity via appearing frequency among the 100,000 playlists.
Here are the TOP 50 popular songs and artists:
The TOP 50 songs are performed by 40 different artists, and 26 of them are TOP 50 artists. 36 of the TOP 50 songs are performed by TOP 50 artists. Popularity of songs and artists seem highly correlated.
For these 681,805 unique tracks, we tried to match Spotify's track features data acquired by audio analysis. We found 681,787 matches, which is over 99.997% of the original 10%MPD. Here is an overview of the features data.
In this dataset:
track_id
is the same beforedanceability
is a measure of how suitable a track is for dancing. This measure comes from a combination of musical characteristics, like tempo, rhythm stability and beat strength. 0 is lease danceable and 1 is most danceable.energy
is a measure of intensity and activity. Tracks with high energy are usually fast, loud and noisy. 0 is least energetic and 1 is most energetic.key
is an indicator of estimated overall key of the track. The integers map to pitched using standard pitch class notation. If no keys detected, the value is -1.loudness
is average loudness of a track in decibels (dB). Typical values range from -60 to 0. mode
is an indicator of modality of a track based on its melodic content. 1 is major and 0 is minor.speechiness
is a measure of spoken words in a track. Value ranges from 0 to 1. Value above 0.66 means the track is probably words in major, like a talk show or audio book. Value from 0.33 to 0.66 means the track could contain both music and words, like rap. Value below 0.33 means the track is mainly music without speech.acousticness
is a confidence measure whether a track is acoustic. Value ranges from 0 to 1. 1 means high confidence of a track being acoustic. instrumentalness
is a measure of how much vocals a track contains. Value ranges from 0 to 1. The closer the value to 1, the higher confidence the track contains no vocal content. liveness
is a measure of audience. Value ranges from 0 to 1. Higher value means higher confidence that the track was performed in a live show.valence
is a measure of positiveness. Value ranges from 0 to 1. High value means the track sounds more positive (e.g. happy, cheerful, etc.). Low value means the track sounds more negative (e.g. sad, unhappy, etc.)tempo
is a measure of speed or pace in musical terminology. The value is the overall estimated tempo of a track in beats per minute (BPM).Here is the summary statistics of these features:
Here are histograms of these features:
Danceability
Energy
Key
Loudness
Mode
Speechiness
Acousticness
Instrumentalness
Liveness
Valence
Tempo
These histograms are similar to the examples given by Spotify. Our sample of tracks is representative.
It is reasonable that some features could be correlated to others. We explored the correlation between the features and plot the correlation matrix. [1]
The size of the square is proportional to the absolute value of correlation. From this plot we found that
[1] https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec