DIALECT-COPA

The DIALECT-COPA shared task on causal commonsense reasoning in South-Slavic dialects

Background

Large language models (LLMs) like GPT and Llama have been shown to possess uncanny language understanding and generation abilities, even surpassing human performance across many usage scenarios. Importantly, they also show capabilities in different languages, reaching beyond the high-resource “usual suspects” such as English, Chinese or Spanish. Performance of the most powerful models even for moderately resourced standard language varieties (e.g., Croatian) remains strong and convincing, and they also seem to work reasonably well for large(r) dialects of major languages (e.g., Bavarian or Sicilian). However, LLMs are bound to exhibit substantial performance drops for small(er) dialects of non-major and low-resource languages [1], for which they have seen little to no text data in pretraining. Dialects, unlike standard languages, often span surprisingly limited geographical areas, with distinct micro-dialects spanning areas as small as single-digit square kilometres. For such dialects, obtaining large quantities of raw text is difficult, which renders good out-of-the-box performance of LLMs for those dialects unlikely and warrants investigation of various cross-lingual transfer and sample-efficient model adaptation strategies.

Task

In this shared task, we invite the community to propose, develop, and test approaches for language understanding in micro-dialects of moderately-resourced South-Slavic languages. Concretely, due to the good balance between the structural simplicity and semantic complexity of the task, we focus on extending the "Choice of Plausible Alternatives" (COPA) task [2,3]. In this task, a classifier has to select which of the two candidate statements are more likely to be the cause or effect of a given premise statement. The task is now extended from four relevant standard languages, Slovenian [4], Croatian [5], Serbian [6], Macedonian [7], to three micro-dialects: (1) the Cerkno dialect of Slovenian, spoken in the Slovenian Littoral region, specifically from the town of Idrija, (2) the Chakavian dialect of Croatian from northern Adriatic, specifically from the town of Žminj, and (3) the Torlak dialect from southeastern Serbia, northeastern North Macedonia, and northwestern Bulgaria, specifically from the town of Lebane.

Data

Please fill out this registration form to participate!

At https://github.com/clarinsi/dialect-copa/ we offer training and development data in the four standard languages, as well as training and development data in the Cerkno and Torlak dialects, while the Chakavian dialect is treated as a surprise dialect. In this unshared task any data can be used except for the test data of any COPA-related dataset.

At the time of testing, test data for all three dialects will be given, namely for the Cerkno, Chakavian and Torlak dialect. While the submissions will be, among others, described also via accuracy on the test data, the focus of this task is not a final leaderboard, but rather the insights obtained through the impact of specific decisions on the system's performance. Therefore we encourage participants to send in as many system outputs as they consider informative.

A data instance in the original COPA dataset consists of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The random baseline lies therefore at 50% accuracy. An example of the instance is the following:

Premise: The man turned on the faucet. What is the effect?
Alternative 1: The toilet filled with water.
Alternative 2: Water flowed from the spout.

The dataset was translated from the English language to each standard language, while the translation into the dialect was performed by translators having access to both the original English version, as well as the existing translation into the standard variety.

To give a first impression of the linguistic diversity of the data, we give examples of each of the languages and dialects below (which also includes the surprise Chakavian dialect).

English: The man turned on the faucet. Water flowed from the spout.
Slovenian: Moški je odprl pipo. Iz ustja pipe je pritekla voda.
Cerkno dialect: Dic je adparu pipa. Iz pipe je partjekla uoda.
Croatian: Muškarac je otvorio slavinu. Voda je potekla iz mlaznice.
Chakavian dialect: Muški je otpra špino. Oda je počela teć z mlaznici.
Serbian: Човек је отворио славину. Вода је текла из славине.
Serbian (transliterated): Čovek je otvorio slavinu. Voda je tekla iz slavine.
Torlak dialect: Човек одврнуја славину. Вода истичала од славину.
Torlak dialect (transliterated): Čovek odvrnuja slavinu. Voda ističala od slavinu.
Macedonian: Човекот ја отвори славината. Истече вода од славината.
Macedonian (transliterated): Čovekot ja otvori slavinata. Isteče voda od slavinata.

English: The girl found a bug in her cereal. She lost her appetite.
Slovenian: Dekle je v kosmičih našlo žuželko. Izgubila je apetit.
Cerkno dialect: Zjala je najdla hruošče u kosmičih. Zgubila je apetit.
Croatian: Djevojka je pronašla kukca u žitaricama. Izgubila je apetit.
Chakavian dialect: Mlada je našla neko blago va žitaricah. Je zgubila tiek.
Serbian: Девојчица је пронашла бубу у житарицама. Изгубила је апетит.
Serbian (transliterated): Devojčica je pronašla bubu u žitaricama. Izgubila je apetit.
Torlak dialect: Девојчица нашла бубаљку међу њојне житарице. Изгубила си апетит.
Torlak dialect (transliterated): Devojčica našla bubaljku među njojne žitarice. Izgubila si apetit.
Macedonian: Девојката пронајде бубачка во нејзините житарки. Изгуби апетит.
Macedonian (transliterated): Devojkata pronajde bubačka vo nejzinite žitarki. Izgubi apetit.

Suggested approaches

We invite participants to investigate and compare various approaches, such as (but not limited to):

Monolingual fine-tuning in the target language itself
Cross-lingual transfer, with fine-tuning on one or more related languages and/or dialects (i.e., single-source or multi-source cross-lingual transfer)
Signal boosting via the exploitation of both the Latin and the Cyrillic script (in the Eastern part of the dialectal continuum the Cyrillic script is also used)
Collecting unlabeled dialectal data and adapting LLMs for a specific dialect
Dialectal data synthesis (by exploiting regularities in their differences to respective standard languages)
Exploitation of dialectal dictionaries
Machine translation between standard language and dialects (in absence of large-scale parallel corpora for these dialects, the multi-parallel training and development portions of the task-specific COPA dataset may be used for low-resource MT), for data synthesis or test-translate approaches
Anything else that seems relevant and even adventurous (be creative!)

Important dates

Shared task announcement: February 1
Test data release: March 4
Test results submission: March 11
System description paper submission: March 24
Notification of acceptance: April 14
Camera-ready: April 24
Workshop date period: June 16–21

References

[1] Kantharuban, A., Vulić, I., & Korhonen, A. (2023, November). Quantifying the Dialect Gap and its Correlates across Languages. In Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 7226-7245).

[2] Roemmele, M., Bejan, C. A., & Gordon, A. S. (2011, March). Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.

[3] Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vulić, I., & Korhonen, A. (2020, November). XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2362-2376).

[4] Žagar, Aleš; Robnik-Šikonja, Marko; Goli, Teja and Arhar Holdt, Špela, 2020, Slovene translation of SuperGLUE, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1380.

[5] Ljubešić, Nikola, 2021, Choice of plausible alternatives dataset in Croatian COPA-HR, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1404.

[6] Ljubešić, Nikola; Starović, Mirjana; Kuzman, Taja and Samardžić, Tanja, 2022, Choice of plausible alternatives dataset in Serbian COPA-SR, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1708.

[7] Ljubešić, Nikola; Koloski, Boshko; Zdravkovska, Kristina and Kuzman, Taja, 2022, Choice of plausible alternatives dataset in Macedonian COPA-MK, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1687.

Page updated

Google Sites

Report abuse