Evaluation 🧑‍⚖️ 

Ranking

03/03/23: We released the evaluation script in this repo. Feel free to use it.


All subtasks will be evaluated and ranked using macro F1-score. 

Measures such as accuracy, binary F1, and other fine-grained metrics computed in terms of false/true negatives/positives will also be considered, but ONLY for analysis purposes on the task overview.  

Statistical significance testing between system runs will be computed. 


There will be one leaderboard per subtask and language, so, you can participate in any subtask and language you want.

Baselines

A set of classical machine learning models and pre-trained deep learning models will be used as baselines:

o Logistic Regression with bag of n–grams at word and character levels.

o Low-Dimensionality Statistical Embeddings (LDSE) (Rangel et al., 2018).

 

o Symanto Brain

o XLM-RoBERTa (Conneau et al., 2019)

o MDeBERTa (He et al., 2021)

o RoBERTa-Large-BNE (Gutiérrez-Fandiño et al., 2021)

o BETO (Cañete et al., 2020)

o Symanto Brain

o XLM-RoBERTa (Conneau et al., 2019)

o MDeBERTa (He et al., 2021)

o RoBERTa (Liu et al., 2019)

o DeBERTa (He et al., 2020)


All these baselines will use default hyperparameters so that participants can explore different hyperparameter configurations of these models or explore new approaches and models. Both classical and modern machine & deep learning approaches are expected and welcome.

 References