Evaluation 🧑‍⚖️

We will use a modified version of the evaluation script from the previous competition, available in this repo. Feel free to use it in your experiments.

All subtasks will be evaluated and ranked using macro F1-score.

Measures such as accuracy, binary F1, and other fine-grained metrics computed in terms of false/true negatives/positives will also be considered, but ONLY for analysis purposes on the task overview.

Statistical significance testing between system runs will be computed.

Baselines

A set of classical machine learning models and pre-trained deep learning models will be used as baselines:

Classical machine learning:
- Logistic Regression with bag of n–grams at word and character levels
- Low-Dimensionality Statistical Embeddings (LDSE) (Rangel et al., 2018)
- Logistic Regression with readability features

Pre-trained large language models:
- Symanto Brain with multilingual encoders (zero and few-shot)
- XLM-RoBERTa (Conneau et al., 2019)
- MDeBERTa (He et al., 2021)
- Multilingual E5 (Liang et al., 2022)
- LLMixTic (Sarvazyan et al., 2024) (winner of SemEval 2024 Task 8, to be published)

Functional baselines:
- Random baseline
- Majority baseline

All these baselines will use default hyperparameters so that participants can explore different hyperparameter configurations of these models or explore new approaches and models. Both classical and modern machine & deep learning approaches are expected and welcome.

References

Rangel, F., Franco-Salvador, M., & Rosso, P. (2018). A low dimensionality representation for language variety identification. In Computational Linguistics and Intelligent Text Processing: 17th International Conference, CICLing 2016, Konya, Turkey, April 3–9, 2016, Revised Selected Papers, Part II 17 (pp. 156-169). Springer International Publishing.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451).
He, P., Gao, J., & Chen, W. (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
Gutiérrez Fandiño, A., Armengol Estapé, J., Pàmies, M., Llop Palao, J., Silveira Ocampo, J., Pio Carrino, C., ... & Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68.
Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020(2020), 1-10.
He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, & Veselin Stoyanov. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.