Evaluation 🧑‍⚖️ 

We will use a modified version of the evaluation script from the previous competition, available in this repo. Feel free to use it in your experiments.


All subtasks will be evaluated and ranked using macro F1-score. 

Measures such as accuracy, binary F1, and other fine-grained metrics computed in terms of false/true negatives/positives will also be considered, but ONLY for analysis purposes on the task overview.  

Statistical significance testing between system runs will be computed. 

Baselines

A set of classical machine learning models and pre-trained deep learning models will be used as baselines:

 



All these baselines will use default hyperparameters so that participants can explore different hyperparameter configurations of these models or explore new approaches and models. Both classical and modern machine & deep learning approaches are expected and welcome.

 References