Evaluation

For each model, we will evaluate the outputs as classification tasks (binary and multi-class, multilabel respectively). Note that the second task is non-trivial from the point of view of evaluation metrics, as it is not a straightforward classification problem: the classes have some degree of hierarchical relation between them. So, for instance, a mistake between group 2 and group 3 is less severe than a mistake between group 2 and group 0. In addition to standard classification metrics (P,R,F1) we will also report the metric ICM (Amigó and Delgado, 2022)  which is better suited for our task.  ICM will be used as the official metric to rank systems.    


Your system output should use the following format (which is a simplification of the training data format):

 

[{

"test_case": "DIPROMATS2023",

"id": 8408,

"tweet_id":  4456456456...,

"language": "en",

"label_task1": false,

"label_task2": [],

"label_task3": []

}, {

"test_case": "DIPROMATS2023",

"id": 8409,

"tweet_id": 87878...,

"language": "en",

"label_task1": true,

"label_task2": ["2 discrediting the opponent", "3 loaded language"],

"label_task3": ["2 discrediting the opponent - name calling", "3 loaded language"]

}...]

 

"label_task1" must be either true or false

"label_task2" may have zero (if label_task1 is false), one or more labels from the list below

"label_task3" may have zero (if label_task1 is false), one or more labels from the list below

 

"label_task1" = true, false    

"label_task2" = ["1 appeal to commonality","2 discrediting the opponent","3 loaded language"]

"label_task3" = ["1 appeal to commonality - ad populum", "1 appeal to commonality - flag waving",

      "2 discrediting the opponent - doubt", "2 discrediting the opponent - Appeal to Fear",

      "2 discrediting the opponent - name calling", "2 discrediting the opponent - undiplomatic assertiveness/whataboutism",

      "3 loaded language"]

 

Each participant group may submit up to five runs. Note that each run may contain results for one or two languages, and for one or more tasks. 

Reference:

Amigó, E. and Delgado, A. (2022). Evaluating Extreme Hierarchical Multi-label Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5809–5819, Dublin, Ireland. Association for Computational Linguistics.