IMPORTANT INFORMATION [NEW VERSION OF THE SHARED TASK DATASET]: As of October 30, 2024, we have updated the shared task dataset on Hugging Face. We identified some inconsistencies in the <EOS> tokens within the KN column, where certain instances either missed an <EOS> token or contained more than intended. These issues have been corrected, and the updated dataset is now available for download.
In addition to paper contributions, we are organizing a shared task on multilingual counterspeech generation with the aim of sharing in a central space current efforts, especially those for languages different to English.
It is envisaged that the shared task would allow the community to study how we can improve counterspeech generation for both lower resource languages but also to reinforce the strong body of research already existing for English.
The counterspeech generated by participants should be respectful, non-offensive, and contain information that is specific and truthful with respect to the following targets: Jews, LGBT+, immigrants, people of color, women.
We release new data consisting of 596 Hate Speech-Counter Narrative (HS-CN) pairs. In this dataset, the HS are taken from MTCONAN [4], while the CN are newly generated. Together with each HS-CN pair, we also provide 5 background knowledge sentences, some of which are relevant for obtaining the Counter Narratives. The dataset is available in 4 different languages (Basque, English, Italian and Spanish) and divided in the following splits:
Development: 100 pairs. [LINK to DATA]
Train: 396 pairs [LINK to DATA]
Test: 100 pairs [AVAILABLE NOW!] [LINK to DATA]
In order to score the shared task participants, the CNs will be kept hidden during the shared task while the HS and the background knowledge will be released for participants to prepare their submissions.
The languages, Basque, English, Italian and Spanish, offer a varied spectrum of complexity including an agglutinative language isolate (Basque), three romance languages (Italian, Spanish) and a Germanic one (English). The choice of languages obeys to the existing linguistic expertise among the organizers required to successfully run the shared task.
Participants also have available the English, Italian, Basque and Spanish CONAN manually curated data:
CONAN [1] (English and Italian): https://github.com/marcoguerini/CONAN/tree/master/CONAN
CONAN [2] (Basque and Spanish): https://huggingface.co/datasets/HiTZ/CONAN-EUS
CONAN-MT-SP [3] (Spanish): https://github.com/sinai-uja/CONAN-MT-SP or https://huggingface.co/datasets/SINAI/CONAN-MT-SP
Multitarget-CONAN [4] (English): https://github.com/marcoguerini/CONAN/tree/master/Multitarget-CONAN
Content Warning: The data contains offensive comments that do not represent the opinion of the organizers. The dataset provided will be used exclusively for the completion of the task, and will not be distributed under any circumstances.
The aim of the shared task is, given a HS (and optionally any additional knowledge the participants may like to use), generate a CN to counteract the HS.
Participants will download the test HS for the 4 languages and generate at most three different CNs per HS for each language). The test window will last 5 days.
Participants are allowed to use any resource (language model, data, etc.) to generate the CN and to participate in any of the languages.
Each team is allowed to submit up to 3 runs. The submission format for each run is as follows:
File Format: Each run should be a CSV file named "TeamName-runX-predictions.csv" where X is the number of the run (1, 2 or 3).
File Content: The CSV file must contain three columns "ID", "KN", and "KN_CN" columns. Each column should follow these guidelines:
ID: This column must contain a unique identifier for each dataset pair. In the original dataset, this identifier is created by concatenating the PAIR_ID and LANG fields (e.g., "IT001").
KN: This column should include any external and specific knowledge used to generate each PAIR_ID in the dataset. It should not be confused with the KN column of our dataset. If no extra or specific knowledge was used to generate the counter-narrative in the KN_CN column, this column can be left empty, (e.g., null or “”).
KN_CN: This column should contain the generated counter-narratives.
Note: Only one form per team will be accepted, so it is recommended that only 1 member of each team fills out the form. [LINK to SUBMISSION FORM]
The CNs submitted by the participants will be evaluated:
Using traditional automatic metrics as in [5], which include BLEU, ROUGE, BertScore and Novelty.
Using LLM as a Judge following the approach described in [6]. The evaluation scripts can be found in this github repository:
https://github.com/hitz-zentroa/eval-MCG-COLING-2025/
IMPORTANT: In the following GitHub repository you can access to the evaluation metrics code [https://github.com/hitz-zentroa/eval-MCG-COLING-2025/]
While using LLM as a Judge [6] is the main metric to rank the submitted runs to the shared task, automatic evaluation of Multilingual Counterspeech Generation remains an open research problem. We therefore provide results with other automatic metrics (specifically overlapping metrics to account for CS generated using the additional knowledge provided for the task). Furthermore, in the workshop we plan to discuss in-depth the pros and cons of the evaluation offered as well as any alternative evaluation methods and additional metrics relevant for the task.
JudgeLM [6]: A LLM-based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation.
BLEU [7]: It compares overlap in tokens from the predictions and references. The scores are averaged over the whole corpus.
ROUGE-L [8]: It computes the overlap between predictions and references by taking into account sentence-level structure similarity naturally and by identifying longest co-occurring in sequence n-grams automatically.
BERTScore [9]. It computes a similarity score for each token in the prediction with each token in the reference. Predictions and references are encoded using BERT contextual embeddings and matches words in references and predictions by cosine similarity.
Novelty [10]. It is computed by calculating the non-singleton n-grams from the generated text that appear in the train data.
gen_len: It refers to the average length of the generated predictions.
English:
Basque:
Italian:
Spanish:
Participants will be required to submit their runs and are asked to describe their systems in paper submissions. We encourage participating teams to highlight the real contribution of their systems in identifying successful approaches along with failed attempts and findings on how to advance in more performant solutions. This description must contain the following details:
Architecture: modules, components, data flow…
Additional data used for training (if any): augmented data, additional datasets…
Pre-trained models used (if any): source of the model, selection criteria…
Experiments conducted and training parameters: configuration, hyperparameters used…
Analysis of results: findings from results, ranking according to different metrics, interpretation, and validation…
Error analysis: a study of failed predictions and their characterization, possible improvements, and lessons learned…
Authors can submit papers up to 8 pages, with unlimited pages for references. Submissions should follow COLING 2025 Author Guidelines and policies for submission, review and citation, and be anonymised for double blind reviewing. Please use COLING 2025 style files; LaTeX style files and Microsoft Word templates are available at https://coling2025.org/calls/submission_guidlines/.
Note: All submissions must be in PDF format and made through START. Moreover the title of the workshop papers must be "TeamName at Multilingual Counterspeech Generation: Title".
Jul 31st
Oct 1st
Oct 21st
Oct 28th
Oct 25th
Nov 4th
Nov 10th
Nov 15th
Nov 20th
Nov 25th
Dec 3rd
Dec 8th
Dec 10th
Dec 13th
Jan 19th
Important: All deadlines are at 23:59 UTC-12 (Anywhere on Earth, AoE Time Zone).
Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem Tekiroglu, and Marco Guerini. 2019. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2819–2829, Florence, Italy. Association for Computational Linguistics.
Jaione Bengoetxea, Yi-Ling Chung, Marco Guerini, and Rodrigo Agerri. 2024. Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2132–2141, Torino, Italia. ELRA and ICCL.
María Estrella Vallecillo Rodríguez, Maria Victoria Cantero Romero, Isabel Cabrera De Castro, Arturo Montejo Ráez, and María Teresa Martín Valdivia. 2024. CONAN-MT-SP: A Spanish Corpus for Counternarrative Using GPT Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3677–3688, Torino, Italia. ELRA and ICCL.
Margherita Fanton, Helena Bonaldi, Serra Sinem Tekiroğlu, and Marco Guerini. 2021. Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3226–3240, Online. Association for Computational Linguistics.
Serra Sinem Tekiroğlu, Helena Bonaldi, Margherita Fanton, and Marco Guerini. 2022. Using pre-trained language models for producing counter narratives against hate speech: a comparative study. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3099–3114, Dublin, Ireland. Association for Computational Linguistics.
Irune Zubiaga, Aitor Soroa, and Rodrigo Agerri. 2024. A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation. In Findings of EMNLP 2024. https://aclanthology.org/2024.findings-emnlp.559/
Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA.
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with BERT. In ICL 2020.
Ke Wang and Xiaojun Wan. 2018. Sentigan: Generating sentimental texts via mixture adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4446–4452.