Shared Tasks

The VarDial workshop has a history of hosting well-attended shared tasks on various dialects and languages - see the recent campaign reports from 2022 and 2021.

In VarDial 2023 we are organizing the shared tasks below. Please fill out this registration form to participate. For registrations after 7th of February, please notify the contact person of the respective shared tasks of your enrolment by e-mail in addition to filling out the registration form.

Tasks

SID for low-resource language varieties (SID4LR)

Organizers: Noëmi Aepli (University of Zurich), Rob van der Goot (IT University of Copenhagen), Barbara Plank (LMU Munich), Yves Scherrer (University of Helsinki)

Contact: naepli@cl.uzh.ch

This task is Slot and Intent Detection (SID) for low-resource language varieties. Slot detection is a span labeling task, intent detection a classification task (see the example below, taken from van der Goot et al, 2021). The test set will contain Swiss German (GSW), South Tyrolean (DE-ST), and Neapolitan (NAP). This shared task seeks to answer the following question: How can we best do zero-shot transfer to low-resource language varieties without standard orthography?

The training data consists of the xSID-0.4 corpus, containing data from Snips and Facebook. The original training data is in English, but we also provide automatic translations of the training data into German, Italian and other languages (the projected nmt-transfer data from van der Goot et al., 2021). Participants are allowed to use other data to train on, as long as it is not annotated for SID in the target languages. Specifically, the following resources are allowed:

Annotated data from other (related and unrelated) languages in the xSID-0.4 corpus
Raw text data from the target languages (e.g. Wikipedia dumps, web crawls)
Pretrained language models containing data from the target languages

Participants are not required to submit systems for both tasks, it is also possible to only participate in one of the two tasks, intent detection (classification) or slot detection (span labeling). The systems will be evaluated with the span F1 score for slots and accuracy for intents as the main evaluation metric as is standard for these tasks. Participants may also submit systems for a subset of the three target languages.

Data and baseline available at: bitbucket.org/robvanderg/sid4lr

Test languages: Swiss German (GSW) & Neapolitan (NAP) & South Tyrolean (DE-ST)

Evaluation: Span F1 score for slots (where both span and label must match exactly) and accuracy for intents

Submission type(s): open (use of pretrained models & external data is allowed)

Discriminating Between Similar Languages - True Labels (DSL-TL)

Organizers: Marcos Zampieri (George Mason University), Kai North (George Mason University), Tommi Jauhiainen (University of Helsinki)

Contact: knorth8@gmu.edu

Discriminating between similar languages (e.g., Croatian and Serbian) and language varieties (e.g., Brazilian and European Portuguese) has been a popular topic at VarDial since its first edition. The DSL shared tasks organized in 2014, 2015, 2016, and 2017 have addressed this issue by providing participants with the DSL Corpus Collection (DSLCC), a collection of journalistic texts containing texts written in multiple similar languages and language varieties. The DSLCC was compiled under the assumption that each instance's gold label is determined by where the text is retrieved from. While this is a straightforward (and mostly accurate) practical assumption, previous research has shown the limitations of this problem formulation as some texts may present no linguistic marker that allows systems or native speakers to discriminate between two very similar languages or language varieties.

We tackle this important limitation by introducing the DSL True Labels (DSL-TL) task. DSL-TL will provide participants with a human-annotated DSL dataset. A sub-set of nearly 13,000 sentences were retrieved from the DSLCC and annotated by multiple native speakers of the included language and varieties included, namely English (American and British), Portuguese (Brazilian and European), Spanish (Argentinian and Peninsular). To the best of our knowledge this is the first dataset of its kind opening exciting new avenues for language identification research.

Track 1 - Three-way Classification: In this track, systems will be evaluated with respect to the prediction of all three labels for each language, namely the variety-specific labels (e.g., PT-PT or PT-BR) and the common label (e.g., PT).
Track 2 - Binary Classification: In this track, systems will be scored only on the variety-specific labels (e.g., EN-GB, EN-US).

Please note that system will be scored on the predictions of all 3 languages on the same test set so it has to be able to distinguish between the different languages as well as the varieties.

Evaluation: Macro F1 score over the language/variety labels (9 on track 1, 6 on track 2).

Submission type(s): Closed or Open

Discriminating Between Similar Languages - Speech (DSL-S)

Organizers: Çağrı Çöltekin (University of Tübingen), Mourhaf Kazzaz (University of Tübingen), Tommi Jauhiainen (University of Helsinki), Nikola Ljubešić (Jožef Stefan Institute and University of Zagreb)

Contact: ccoltekin@sfs.uni-tuebingen.de

In the DSL-S 2023 shared task, participants use training and development sets from the Mozilla Common Voice (CV) to develop a language identifier for speech. The nine languages selected for the task come from four different subgroups of Indo-European or Uralic language families. The test data used in this task is the Common Voice test data for the nine languages. The participants are asked not to evaluate their systems themselves nor in any other way investigate the test data before the shared task results have been published. The total amount of unpacked speech data is around 15 gigabytes. Only the .mp3 files from the test set must be used when generating the results. The metadata concerning the test audio files, including their transcriptions, must not be used. This task is audio only.

The 9-way classification task is divided into two separate tracks. Only the training and development data in the Common Voice dataset are allowed in the closed track, and no other data must be used. This prohibition includes systems and models trained (unsupervised or supervised) on any other data. On the open track, the use of any openly available (available to any possible shared task participant) datasets and models not including or trained on the Mozilla Common Voice test set is allowed.

Further instructions available at: github.com/dsl-s/dsl-s.github.io

Test languages: Swedish, Norwegian Nynorsk, Danish, Finnish, Estonian, Moksha, Erzya, Russian, and Ukrainian

Evaluation: Macro F1 score over the nine languages

Submission type(s): Closed, Open

Dates

Training set release: January 23, 2023
Test set release: February 6, 2023
Submissions due: February 17 27, 2023
Paper submission deadline: February 27 March 6, 2023
Notification of acceptance: March 13 16, 2023
Camera-ready papers due: March 27, 2023

Paper Submission

Participants are invited to submit a paper to VarDial 2023 describing their systems.

The paper will be presented at the workshop and it will be included in the VarDial proceedings available at the ACL Anthology.

Your paper should be maximum 9 pages long + references and it should be formatted according to the EACL guidelines. System paper submissions are single-blind. You should include your names and affiliation in the paper.

Please upload your paper by March 6, 2023 (anywhere in the world) on START:
softconf.com/eacl2023/VarDial2023/

Page updated

Google Sites

Report abuse