Task description

Task settings and evaluation

The FadeIT shared task focuses on detecting fallacies expressed in Italian social media texts about migration, climate change, and public health issues at different granularities: at the post level (coarse-grained fallacy detection) and/or at the span level (fine-grained fallacy detection). For each text, there can be zero, one, or more fallacies among an inventory of 20 fallacy types (check the inventory here).

Participants can participate in just one or both subtasks and can specify their (non-binding) preference during registration.

Subtask A: Coarse-grained fallacy detection

Given the text of a social media post, predict the fallacies expressed in it. This is a multi-label classification task (using 20 fallacy classes) and represents the easiest setup – i.e., there is no need to locate each of them within the text, just to detect which ones (if any) are expressed in it.

Evaluation

Systems will be evaluated according to micro Precision, Recall, and F1 score, averaged on two equally valid gold standard annotations of the 20% held-out test set. Systems will be ranked by micro F1 score.

Subtask B: Fine-grained fallacy detection

Given the text of a social media post (pre-divided into tokens), predict the text segments of the fallacies expressed in it. This is a multi-label sequence labeling task (using 20 fallacy classes) and represents the hardest setup – i.e., different fallacies may partially overlap with each other.

Evaluation

Systems will be evaluated according to metrics designed for span-level annotations with potential overlaps, averaged on two gold standard annotations of the 20% held-out test set. We adopt micro Precision, Recall, and F1 score variants proposed by Da San Martino et al. (2019), extended to work at the token level. Partial credit is therefore given to partial span matches, proportional to the length of the match in terms of tokens. To account for the severity of labeling errors (e.g., predicting Red herring instead of Appeal to authority is less problematic than predicting False dilemma), results are also computed in the soft evaluation mode, giving partial credit (i.e., 0.5 instead of 1.0) if the predicted label is an immediate parent of the actual label in the taxonomy of fallacy types by Ramponi et al. (2025). Systems will be ranked by micro F1 score (soft evaluation).

Baselines

Baseline results for both coarse-grained fallacy detection (subtask A) and fine-grained fallacy detection (subtask B) are available to participants on the FadeIT repository on GitHub.

Additional information

Is there a limit to the number of runs (i.e., predictions on test data) that can be submitted?

In the evaluation phase, we will accept up to 3 runs for each subtask from each participant team. Different runs can reflect e.g., different solutions or different configurations of the same system. The format for the prediction file to submit is described in the FadeIT repository on GitHub.

For example, if you participate in subtask A and B, you will be able to submit up to 6 runs, of which up to 3 for subtask A and up to 3 for subtask B.

Can I use other resources in addition to the data (train/dev) provided by the task organizers?

Sure, and we encourage you to do so! Participants are allowed to use external resources in addition to (or in place of) the data provided by the organizers to train their models. Examples of allowed external resources are pre-trained models, existing fallacy detection datasets, and newly annotated data. Participants are also encouraged to leverage time and topic metadata associated with the posts for designing their solutions. In case of doubt, you can write to us through the Google Group.

Page updated

Report abuse