We used playlists and soundtracks data from Spotify. Spotify develops categories for playlists, which we will be using as proxies for music genre. Besides, Spotify also has detailed metrics for sound features, which quantitatively and thoroughly describe the sound tracks. Each song and playlist have unique ids, giving conveniences when merging the datasets. The Spotify data offers a well-constructed platform where we can conduct analysis for our research goal. For music ranking, we will collect billboard weekly music rankings including the overall rank and ranks by music genre, which provides channels for more detailed analysis involving trending. On the aspect of analysis, we are able to conduct either descriptive or predictive analysis with the versatile datasets obtained from Spotify and Billboard. A few topics we interested in includes:
• Are there potential subcategories for a given category?
• How do popular songs’ audio features change over the year?
• What classification techniques yield better results using sound feature?
• How do artists connect on the fundamental level?
• Do song title themes varies across categories?
To collect data from Spotify, we used the ‘spotipy’ package from github. Spotify started to uses Oauth2 authentication for every request. Using the package simplifies our workflow for extracting data. To avoid privacy issues, we only obtained information in the public scope. To start with, we identified the category names listed on Spotify. Then we used the category names as searching keywords to get related playlists. After, using the sound track ids contained in the playlists, we were able to acquire information on the sound tracks and their audio features. Finally, we merged the sound track information and audio features meanwhile assigning a new variable indicating what category name the songs belonged to.
To collect data from Billboard, we used the ‘billboardpy’ package from github. The package generalized the web scraping process to functions with given searching parameters. We picked two overall rankings: ”Hot 100”, and “Greatest Hot Singles 100’’, and one ranking in each of the nine categories Billboard has and acquired rankings in the past year (the most recent 51 weeks from Sept. 30 2017). At the end, we merged the all the rankings, meanwhile assigning a new variable indicating the genre.
To develop a criterion for the cleanliness of the data, we used the score composed by the sum of errors (missing values, bad values, wrong data type) percentages. Duplicates and irrelevant variables were deleted beforehand. Our quality score here ranged from 0 to 100. Our formula to calculate the quality score for spotify data is (1-missing value percentage)*50 + (1-out of range percentage)*50. And the formula for billboard data is (1-missing value percentage)*45 + (1-out of range percentage)*45 + (1-logical error percentage)*10. That is, 100 is the highest score which represent the data is very clean and the lower the score, the “dirtier” the variable. All of our variables scored higher than 80, suggesting that overall, the datasets are clean on a broad sense. While all variables have missing values, a few contains out-of-range values, and only to a small proportion. In between datasets, the Billboard dataset is cleaner with a higher median score since it contains fewer variables, has a simpler structure and the numerical variables themselves are integer based. One other observation is that the scores for the Spotify variables are roughly in two clusters. We speculate that the values for audio feature variables were all recorded at once, meaning that missing in one would result in missing in all others.
By removing the blank rows and duplicate rows, replacing the missing cells with mean or median, dropping all the missing cells in string type columns, we obtained two clean datasets available for our first round of analysis. All the cleaning scores were enhanced to 100 for the spotify dataset, and the scores for the Billboard enhanced to more than 98. Overall, the cleaning procedures were effective and necessary in the pre-analysis phase.