
The dataset employed for the two original tasks was collected over the past decade via the Twitter API and is described in the overview paper of Homo-MEX 2023. Conversely, the dataset designated for task 3 was procured through web scraping techniques aimed at retrieving song lyrics. All the lyrics are written in spanish and encompass various rhythms and genres. An effort was made to include genres and artists, both those considered homophobic in their songs and those supporting the LGBTQ+ community. The datasets underwent manual labeling, delineating instances of LGBT+ phobia. You can see a summary of the corpus track in the proposed shared tasks bellow:

Competition (data and evaluation) available at Codabech

Track 1: Hate speech

Number of instances: ~12,000

Categories: LGBT+phobic (P) vs Not-LGBT+phobic (NP), not LGBT+related, (NR).

Partitions (train/test): 80/20

Track 2: Fine-grained hate speech

Number of instances: ~5,000

Categories: Gayphobia (G), Lesbophobia (L), Transphobia (T), Biphobia (B), Aphobia (A) and not LGBT+related (NR)

Partitions (train/test): 80/20

Track 3: Homophobic Lyrics

Number of instances: ~1,200

Categories: LGBT+phobic (P) vs Not-LGBT+phobic (NP)

Partitions (train/test): 80/20