We found publicly available online datasets to analyze instead of manually scraping the relevant information. This gave us mildly mismatched types and formats. For example, we use the Billboard Hot 100 yearly with lyrics and the Billboard Hot 100 weekly without lyrics, Spotify songs with lyrics that only reach to 2020, a gender dataset that itself references another Github for its raw data, and a general demographics dataset that only covers 2000 to the present.
The Billboard Hot 100 and Spotify present effective approximations of trends in music from different angles, providing a significant amount of accounting for potential source bias. The former describes trends seeing commercial success by selecting the most popular songs, while the latter takes a broader view of music. Combined, they can corroborate each other’s findings and decrease the possibility of a perceived phenomenon being solely an artifact of data collection, as seen in the shared spike in mentions of smoking in the 1990s.
The gender dataset we used for this project is incomplete, with gender information missing for some artists, which results in gaps in any possible analysis. It is also almost universally binary, with only 31 artists listed as nonbinary out of over 7,000. The missing gender entries are overwhelmingly more common, at 248, leaving open a wide range of possibilities without further explanation. Due to our use of an exterior dataset for the sake of practicality in implementation, there is limited information on how exactly these missing entries came to be and what measures, if any, were applied to limit their impact, leaving the actual number of nonbinary artists uncertain.
Of course, our choice of datasets is heavily influenced by the information available online. The Billboard Hot 100 gives a reasonably large selection of popular songs with total coverage within its time span, but it only captures the top hundred at any given time, and gives only relative popularity within a given week, which for example could overstate the influence of music released during relative lulls, or miss or underrepresent major movements that lacked the mainstream popularity to consistently reach the Billboard Hot 100. On the other hand, Spotify has a minimal barrier to entry that leads to a wide variety of songs of all types in more recent years, but legal snarls over streaming licensing may bias it in difficult-to-predict ways, and the number of songs trails off in earlier decades.
Our topic analysis of lyrics and trends analysis of certain vocabulary in lyrics also has potential issues. Our methodology counts instances of word use, which already has imprecision when applied to words with multiple meanings and is only further obfuscated by figurative use and intended imagery or connotations, which would be simply impossible to analyze in any meaningful depth and consistency across datasets of this size. When tracking sets of words, we selected a broad range in an attempt to mitigate this kind of error, but it was necessary to assume that these assorted cases would approximately average out and notable trends or correlations would make themselves visible despite any potential effect.
Another issue is that some of our datasets are limited to specific time frames, leading to parts of the intersectionality and topic analysis sections of this project only covering from 2000 to a few years before the present, which in turn may lack sufficient time frame to capture relevant phenomena.