The Million Song Dataset contains music-related information about one million popular contemporary songs and their artists. Included is information about the song such as BPM, key, and duration, as well as information about the artist such as genre, name, and location. This project was a collaboration between the company Echo Nest and the laboratory LabROSA to compile a significant amount of data on music to promote algorithms involved in intelligent machine listening and song information retrieval. The National Science Foundation of America (NSF) provided funding that enabled the creation of a commercial amount of data for future studies and research.
The dataset consists of a wide range of information that describes different attributes of many artists and songs, including basic characteristics such as artist name, familiarity, genre, and others. It also contains more advanced musical theory data, such as length, tempo, and time signature. This range opens the door for many avenues of analysis because individuals looking to explore this dataset can approach it from several angles depending on their area of interest. For example, if a researcher wanted to investigate any correlations between specific song or artist characteristics with popularity or other metrics, this dataset would enable them to do so because of the widespread link between the two topics. Furthermore, this dataset provides insight into how music theory has changed over time, as well as information from the listeners' perspective as favorite genres and artists can be extrapolated from the dataset. Since the scope of this library is so vast, in terms of quantity of information as well as a variety of data points, researchers are able to limit the effects of sampling bias and produce more applicable and translatable results.
One important aspect that this dataset highlights is how popular music genres changed during the recent decades. Different eras of music emerged over time as new artists revolutionized production to create new sounds. The birth of new genres speaks to societal conditions of the time, with the 1960s showing the emergence of folk music and influential artists such as The Beatles. Music shed light on social changes during the time and reflected the sentiment of people voicing their opinion on national issues such as racial injustice and global affairs such as the Vietnam War. This pattern of music acting as a political, social, and economic representative of the time period can be seen across the decades this dataset spans and enables the investigation of questions outside the realm of music. Key moments in the music world can also be addressed such as the appearance of boy bands and teen pop superstars in the early 2000s. The dataset encompasses the evolution of music through time.
The Evolution of Music Through Time as Sung by the Acapella Group: Pentatonix
Although this library is comprehensive regarding individual songs, there is no data related to albums. Assuming that many of the songs in the dataset were part of an album, it would be helpful to have data on those albums to understand how consumer preferences, and society, changed over time. This analysis could use data on the popularity of specific albums, or if/how album structure changed over time. In addition, this dataset does not highlight the environment in which songs are frequently played, which is interesting from a sociocultural perspective. For example, racial or ethnic groups could affect genre preferences, which could have implications for societal modularity. Additionally, the presence of music labels and production companies in developing these songs is also missing from the dataset; record labels such as Sub Pop and Death Row Records had undeniable roles in ushering genres such as alt-rock and hip-hop, respectively, into the public consciousness. However, the dataset does thoroughly describe the artist information and music features it intends to.
Compiling such large quantities of data can be challenging - the standardization of metrics such as hotness and familiarity on scales from 0-1 and including the confidence of other statistics show recognition that assessment of so many different types of songs may not be entirely accurate. This acknowledgment frees researchers to interpret the data of their own accord rather than creating a dataset with biases designed to prejudice readers. This project does not appear to contain implicit messages and is representative of an objective and informative library meant to inform and educate people about various aspects of music.
Regarding the ontology of the dataset, some information was extracted from external sources such as musicbrainz.org or 7digital.com. Echo Nest must have combined data from multiple sources to generate such a vast dataset. The dataset is organized by displaying topics related to the artist earlier in the library and listing characteristics relating to song theory more towards the bottom. This structuring makes logical sense because similar types of information are grouped, making it easier for the user to navigate and locate the specific data they want.
Furthermore, it should be noted that the dataset’s ontology makes it suitable to apply theoretical lenses when researching specific questions. For example, to better understand music trends from a critical feminist perspective, names of female artists or groups can be displayed and investigated to determine attributes such as common genres or time period of popularity. Similar concepts can be applied to research using a critical race perspective, where individuals interested in artists of different ethnicities can determine if there have been any phenomena correlated with the socio-political climate of the time.
Article Discussing the Usage of Pop Music as Critical Text