Exploratory Data Analysis

Data Source

Our data consists of the Million Playlist Datasets (MPD) and the Track Audio Features Data from Spotify. The first part includes data of literally one million playlists recorded by Spotify, describing track names, artist names, album names, the corresponding uri for each track, and duration of the tracks. These playlists were randomly selected from all playlists within the following criteria:

was public when MPD was generated
was created after January 1, 2010 and before December 1, 2017
contains at least 5 tracks
contains no more than 250 tracks
Contains at least 3 unique artists
Contains at least 2 unique albums
all tracks are in Spotify's database
has at least one follower, not including the creator
others ethics restrictions applied

The original MPD is huge (up to 30GB) so we decided to use 10% of the whole datasets for analysis (100,000 playlists), given the limited calculation power and memory. We scraped with given track uri and kept the reasonable features accordingly.

Data overview and analysis

MPD

First we focus on MPD. The original MPD consists of 1000 csv files, with each recording 1000 playlists. We selected the first 100 csv and combined them into a separate datasets (10% MPD).

Here is an overview of the dataset's structure:

In the transformed dataset:

pid is the id of playlist (0-999);
pos is the position number of tracks in the playlist;
artist_name , track_name and album_name are names of these elements;
artist_uri, track_uri and album_uri are the uri in Spotify website to identify unique artist/track/album;
duration_ms is the length of a track.

This dataframe has 6,677,800 rows and 9 columns. However, we found that some songs may appear in the same playlist multiple times (with identical uri). After excluding these duplicates, we found that there were 6,589,079 valid track records, 3,812,169 valid artist records and 4,978,987 valid album records.

We then calculated how many tracks/artists/albums are there per playlist and inspected the distribution.

Here is an overview of the dataset's structure:

Here are the distribution of these variables:

All these distribution are right-skewed. If there is no "250 tracks" restriction, the distribution may have even longer tail.

Using uri to avoid identical names, we found 681,805 unique tracks, 110,063 unique artists and 271,413 unique albums among these 100,000 playlists. We also wanted to know how popular these songs and artists are. We measured popularity via appearing frequency among the 100,000 playlists.

Here are the TOP 50 popular songs and artists:

The TOP 50 songs are performed by 40 different artists, and 26 of them are TOP 50 artists. 36 of the TOP 50 songs are performed by TOP 50 artists. Popularity of songs and artists seem highly correlated.

Features of Tracks

For these 681,805 unique tracks, we tried to match Spotify's track features data acquired by audio analysis. We found 681,787 matches, which is over 99.997% of the original 10%MPD. Here is an overview of the features data.

In this dataset:

track_id is the same before
danceability is a measure of how suitable a track is for dancing. This measure comes from a combination of musical characteristics, like tempo, rhythm stability and beat strength. 0 is lease danceable and 1 is most danceable.
energy is a measure of intensity and activity. Tracks with high energy are usually fast, loud and noisy. 0 is least energetic and 1 is most energetic.
key is an indicator of estimated overall key of the track. The integers map to pitched using standard pitch class notation. If no keys detected, the value is -1.
loudness is average loudness of a track in decibels (dB). Typical values range from -60 to 0.
mode is an indicator of modality of a track based on its melodic content. 1 is major and 0 is minor.
speechiness is a measure of spoken words in a track. Value ranges from 0 to 1. Value above 0.66 means the track is probably words in major, like a talk show or audio book. Value from 0.33 to 0.66 means the track could contain both music and words, like rap. Value below 0.33 means the track is mainly music without speech.
acousticness is a confidence measure whether a track is acoustic. Value ranges from 0 to 1. 1 means high confidence of a track being acoustic.
instrumentalness is a measure of how much vocals a track contains. Value ranges from 0 to 1. The closer the value to 1, the higher confidence the track contains no vocal content.
liveness is a measure of audience. Value ranges from 0 to 1. Higher value means higher confidence that the track was performed in a live show.
valence is a measure of positiveness. Value ranges from 0 to 1. High value means the track sounds more positive (e.g. happy, cheerful, etc.). Low value means the track sounds more negative (e.g. sad, unhappy, etc.)
tempo is a measure of speed or pace in musical terminology. The value is the overall estimated tempo of a track in beats per minute (BPM).

Here is the summary statistics of these features:

Here are histograms of these features:

Danceability

Energy

Key

Loudness

Mode

Speechiness

Acousticness

Instrumentalness

Liveness

Valence

Tempo

These histograms are similar to the examples given by Spotify. Our sample of tracks is representative.

It is reasonable that some features could be correlated to others. We explored the correlation between the features and plot the correlation matrix. [1]

The size of the square is proportional to the absolute value of correlation. From this plot we found that

Acousticness is negatively correlated to energy and loudness. This is reasonable since non-acoustic music is usually modified (also magnified) electronically.
Loudness and energy are highly positively correlated. This is reasonable since energy is measured partially based on loudness.
Loudness is negatively correlated to instrumentalness. This indicates that loudness of a track mainly comes from vocal content, e.g. singing and shouting. For most tracks, instrument part is kind of background music.
Valence is positively correlated to danceability, energy and loudness. This is reasonable since fast-pace, loud music is usually expressing people's positive mood, and people tend to dance more with positive mood.

References

[1] https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec

Page updated

Google Sites

Report abuse