Competition at Codabench is now aviable
Competition (data and evaluation) available at Codabench
The dataset was collected using web scraping techniques and APIs from platforms like Genius, which were designed to retrieve song lyrics. The extraction process followed a structured approach: first, an algorithm navigated through AZLyrics, starting from the main page, then moving to the artist’s page, followed by the album page, and finally reaching the song page, where the complete lyrics were obtained.
To ensure accuracy in lyrics transcription, rhythm consistency, and correct metadata, the Genius API was used as a validation layer. This allowed for cross-checking the retrieved lyrics, confirming song metadata such as release date and album, and ensuring that the dataset maintained high-quality standards.
All lyrics are written in Spanish and span a variety of rhythms and genres. To ensure accurate annotation, the filtering process included slang, general terminology, and metaphorical expressions related to misogyny.
To ensure a balanced and comprehensive evaluation of the Misongyny dataset, we've established an annotation protocol involving three distinct participant groups: Men (M), Women (W), and members of the Feminist Community (FC). Fifteen individuals—five from each group—will be divided into five subgroups, each containing one member from M, one from W, and one from FC. This structure is designed to mitigate bias by incorporating diverse social perspectives into the classification process.
With data collection for the Misongyny dataset now complete, we've entered the annotation phase. Each subgroup will review an equal portion of the approximately 2,500 song lyrics. Annotators will perform two key tasks. First, they will determine whether the lyrics contain misogynistic content (binary classification: misogynistic or not). Second, for lyrics identified as misogynistic, they will specify the type of misogyny, choosing one of four categories: Sexualization (S), Violence (V), Hate (H), or Not Related (NR). Additionally, annotators will tag the lyrics by verse, identifying the specific phrases or words where misogyny is detected. This verse-level tagging anticipates a future Named Entity Recognition (NER) subtask. For this initial annotation phase, this two-step approach ensures both a nuanced understanding of the presence of misogyny and a clear categorization of its form. This collaborative and well-structured methodology aims to capture a broad range of viewpoints, ultimately strengthening the reliability and validity of the Misongyny dataset annotations.
Number of instances: ~2,500
Categories: Not Misogynist (NM) and Misogynist (M)
Partitions (train/validation/test): 75/5/20
Number of instances: ~2,500
Categories: Sexualization (S), Violence (V), Hate (H), Not Related (NR)
Partitions (train/validation/test): 75/5/20