Dataset

A new dataset has been created for this task. We are going to release the corpus CONAN-MT-SP, which consists of HS-CN pairs covering 8 different hate targets (disabled, Jews, LGBT+, migrants, Muslims, people of colour, women and other groups). 

Figure 1: A scheme that shows the process of the creation of CONAN-MT-SP.

As we can see in Figure 1, to build CONAN-MT-SP, we use the hate speech of the English MultiTarget CONAN (CONAN-MT) corpus (Fanton et al. 2021) that collected its HS-CN pairs by niche sourcing from two different NGOs and subsequently used these pairs to generate more HS-CN with GPT-4 with human review integrated into the process. Due to the fact that the hate speech message is in English in CONAN-MT, we translate it into Spanish using the DeepL API. All translations were reviewed by our annotators, and in those pairs where the translations were erroneous, they were edited. The associated counternarrative (CN) to each hate-speech message (HS) is generated by the GPT-4 model using a prompt strategy. The strategy used consisted in a Few Shot Learning Strategy, where the model was prompted with a task description and 8 examples of HS-CN pairs (one for each target). In addition, the counternarrative generated by GPT-4 has been evaluated by human experts using different metrics: 



In RefutES, we selected from this corpus the “perfect” counter-narratives, i.e., those that are non-offensive, in complete disagreement, specific and informative, compellingly truthful, do not need editing, and are equal to or better than the initial CONAN-MT counter-narrative. The corpus is divided into three subsets, each related to a different part of the competition:


The refutES corpus is composed by the followig features that are the columns in the provided CSVs:


The data associated for this task will be provided to the participants following the established dates.

Now, the dev, train and TEST splits are available at https://github.com/sinai-uja/RefutES.git

Content Warning: The data contains offensive comments that do not represent the opinion of the organizers. The dataset provided will be used exclusively for the completion of the task, and will not be distributed under any circumstances.