DSL-ML
DSL-ML - Multi-label classification of similar languages
Background
Discriminating between similar languages (e.g., Croatian and Serbian) and language varieties (e.g., Brazilian and European Portuguese) has been a popular topic at VarDial since its first edition. Most shared tasks on this topic were based on datasets compiled under the assumption that each instance's gold label is determined by where the text is retrieved from. While this is a straightforward (and mostly accurate) practical assumption, previous research has shown the limitations of this problem formulation as some texts may present no linguistic marker that allows systems or native speakers to discriminate between two very similar languages or language varieties.
Various recent efforts proposed to reformulate this problem, either by introducing an additional label for ambiguous instances, or by allowing multiple labels for ambiguous instances:
The DSL-TL task organized at VarDial 2023 [1] [2] focuses on binary classification (e.g., Brazilian vs European Portuguese) and contains a third label (“both/neither”) that is used for instances that cannot be assigned to a single variety. However, this approach cannot be easily extended to settings with more than two varieties.
Bernier-Colborne et al. (2023) [3] analyze the FreCDo dataset [4] used in the FDI 2022 shared task [5] and propose an automatic multi-label conversion based on near-duplicate analysis.
In a similar vein, Keleg & Magdy (2023) [6] propose that Arabic dialect identification tasks should be framed as multi-label classification tasks.
Task
The 2024 DSL-ML task assembles datasets from five different macro-languages and with different types of multi-label annotations, as summarized in the table below:
Participants are expected to provide multi-label annotations for the test set instances. The participating systems will be evaluated on macro-average F1 for each test sets, and aggregated over the five test sets.
Closed track: Systems may only use the labeled training data provided for the task. The use of pre-trained models is allowed as long as they are not specifically pre-trained or fine-tuned on language identification tasks.
Open track: Systems may use any data and pre-trained models, except the prohibited datasets listed in the language description.
Data
The training and development data is available at https://github.com/yvesscherrer/DSL-ML-2024. The test sets will be shared privately with registered participants. Please fill out this registration form to participate!
References
[1] Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, and Marcos Zampieri. 2023. Findings of the VarDial Evaluation Campaign 2023. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 251–261, Dubrovnik, Croatia. Association for Computational Linguistics.
[2] Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera. 2023. Language Variety Identification with True Labels. arXiv preprint arXiv:2303.01490.
[3] Gabriel Bernier-Colborne, Cyril Goutte, and Serge Leger. 2023. Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 142–151, Dubrovnik, Croatia. Association for Computational Linguistics.
[4] Mihaela Gaman, Adrian-Gabriel Chifu, William Domingues, and Radu Tudor Ionescu. 2022. FreCDo: A Large Corpus for French Cross-Domain Dialect Identification. arXiv preprint arXiv:2212.07707.
[5] Noëmi Aepli, Antonios Anastasopoulos, Adrian-Gabriel Chifu, William Domingues, Fahim Faisal, Mihaela Gaman, Radu Tudor Ionescu, and Yves Scherrer. 2022. Findings of the VarDial Evaluation Campaign 2022. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 1–13, Gyeongju, Republic of Korea. Association for Computational Linguistics.
[6] Amr Keleg and Walid Magdy. 2023. Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification. In Proceedings of ArabicNLP 2023, pages 385–398, Singapore (Hybrid). Association for Computational Linguistics.
[7] Peter Rupnik, Taja Kuzman, and Nikola Ljubešić. 2023. BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 113–120, Dubrovnik, Croatia. Association for Computational Linguistics.