Evaluation 🧑⚖️
Ranking
03/03/23: We released the evaluation script in this repo. Feel free to use it.
All subtasks will be evaluated and ranked using macro F1-score.
Measures such as accuracy, binary F1, and other fine-grained metrics computed in terms of false/true negatives/positives will also be considered, but ONLY for analysis purposes on the task overview.
Statistical significance testing between system runs will be computed.
There will be one leaderboard per subtask and language, so, you can participate in any subtask and language you want.
Baselines
A set of classical machine learning models and pre-trained deep learning models will be used as baselines:
Classical machine learning:
o Logistic Regression with bag of n–grams at word and character levels.
o Low-Dimensionality Statistical Embeddings (LDSE) (Rangel et al., 2018).
Pre-trained large language models:
Spanish
o XLM-RoBERTa (Conneau et al., 2019)
o MDeBERTa (He et al., 2021)
o RoBERTa-Large-BNE (Gutiérrez-Fandiño et al., 2021)
o BETO (Cañete et al., 2020)
English
o XLM-RoBERTa (Conneau et al., 2019)
o MDeBERTa (He et al., 2021)
o RoBERTa (Liu et al., 2019)
o DeBERTa (He et al., 2020)
All these baselines will use default hyperparameters so that participants can explore different hyperparameter configurations of these models or explore new approaches and models. Both classical and modern machine & deep learning approaches are expected and welcome.
References
Rangel, F., Franco-Salvador, M., & Rosso, P. (2018). A low dimensionality representation for language variety identification. In Computational Linguistics and Intelligent Text Processing: 17th International Conference, CICLing 2016, Konya, Turkey, April 3–9, 2016, Revised Selected Papers, Part II 17 (pp. 156-169). Springer International Publishing.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020, July). Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440-8451).
He, P., Gao, J., & Chen, W. (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
Gutiérrez Fandiño, A., Armengol Estapé, J., Pàmies, M., Llop Palao, J., Silveira Ocampo, J., Pio Carrino, C., ... & Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68.
Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020(2020), 1-10.
He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, & Veselin Stoyanov. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.