Task 1 Evaluation & Baselines

Evaluation

Systems will be evaluated based on two metrics:



ICM will be used as the official metric to rank systems for the first task on propaganda detection. 


For each model, outputs will be evaluated as classification tasks. Note that the propaganda categorization is non-trivial from the point of view of evaluation metrics, as it is not a straightforward classification problem: the classes have some degree of hierarchical relation between them. So, for instance, a mistake between group 2 and group 3 is less severe than a mistake between group 2 and group 0.  This feature makes ICM better  suited for the problem, as it considers these imbalances. 

Baselines 

We will provide baselines based on Roberta-large (Liu, 2019), MarIA (Gutiérrez-Fandiño, 2021) and LLaMA (Joulin et al., 2023). We will also report baselines of GPT-4 based on a few-shot learning mode. 


Output format

Your system output should use the following format (which is a simplification of the training data format):


[{

"test_case": "DIPROMATS2024-TASK1",

"id": 8408,

"tweet_id":  4456...,

"language": "en",

"label_task1": false,

"label_task2": [],

"label_task3": []

}, {

"test_case": "DIPROMATS2024-TASK1",

"id": 8408,

"tweet_id": 8787...,

"language": "en",

"label_task1": true,

"label_task2": ["2 discrediting the opponent", "3 loaded language"],

"label_task3": ["2 discrediting the opponent - name calling", "3 loaded language"]

}...]

 

"label_task1" must be either true or false

"label_task2" may have zero (if label_task1 is false), one or more labels from the list below

"label_task3" may have zero (if label_task1 is false), one or more labels from the list below

 

"label_task1" = true, false    

"label_task2" = ["1 appeal to commonality","2 discrediting the opponent","3 loaded language"]

"label_task3" = ["1 appeal to commonality - ad populum", "1 appeal to commonality - flag waving", "2 discrediting the opponent - doubt", "2 discrediting the opponent - Appeal to Fear","2 discrediting the opponent - name calling", "2 discrediting the opponent - undiplomatic assertiveness/whataboutism", "3 loaded language"]

 

Each participant group may submit up to five runs. Note that each run may contain results for one or two languages, and for one or more subtasks. 


Reference:

Amigó, E. and Delgado, A. (2022). Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.