Shared Tasks

The VarDial workshop has a history of hosting well-attended shared tasks on various dialects and languages - see the recent campaign reports from 2021 and 2020 .

In VarDial 2022 we are organizing the shared tasks below. Please fill out this registration form to participate.

Tasks

French Cross-Domain Dialect Identification (FDI)

Organizers: Radu Tudor Ionescu (University of Bucharest), Adrian Gabriel Chifu (Aix-Marseille Université), William Domingues (Aix-Marseille Université), Mihaela Găman (University of Bucharest)

Contact: raducu.ionescu@gmail.com

In the 2022 French Dialect Identification (FDI) shared task, participants have to train a model on news samples collected from a set of publication sources and evaluate it on news samples collected from a different set of publication sources. Not only the sources are different, but also the topics. Therefore, participants have to build a model for a cross-domain 4-way classification by dialect task, in which a classification model is required to discriminate between the French (FH), Swiss (CH), Belgian (BE) and Canadian (CA) dialects across different news samples. The corpus is divided into training, validation and test, such that the publication sources and topics are distinct across splits. The training set contains 358,787 samples. The development set is composed of 18,002 samples. Another set of 36,733 samples are kept for the final evaluation.

Submission type(s): Closed, Open


Identification of Languages and Dialects of Italy (ITDI)

Organizers: Noëmi Aepli (University of Zurich) and Yves Scherrer (University of Helsinki)

Contact: naepli@cl.uzh.ch

We provide participants with Wikipedia dumps (“pages-articles-multistream.xml.bz2”, from 01.03.2022) of 11 languages and dialects of Italy for training (Piedmontese, Venetian, Sicilian, Neapolitan, Emilian-Romagnol, Tarantino, Sardinian, Ligurian, Friulian, Ladin, Lombard). The Standard Italian raw Wikipedia dump may also be used as training data, but there will not be any instances of Standard Italian in the development and test sets. Please use the provided script to download (and extract, if you wish) the dumps to make sure you work with the correct kind and date of the dump.

The task is classification, i.e. the model is required to discriminate between different language varieties. As the training data is provided in the form of raw Wikipedia dumps, careful pre-processing of the data is part of the task. The task is closed, therefore, participants are not allowed to use external data to train their models. Exceptions are off-the-shelf pre-trained language models from the HuggingFace model hub or similar, the use of which has to be clearly stated. The test set will contain newly collected text samples of a subset of the given language varieties for training. The systems will be evaluated on sentence level.

Submission type(s): Closed

Dialectal Extractive Question Answering (DialQA)

Organizers: Fahim Faisal and Antonis Anastasopoulos (GMU)

Contact: ffaisal@gmu.edu

The Dialectal Extractive Question Answering Shared Task invites participants to build QA systems that are robust to dialectal variation. The task builds on existing QA benchmarks (TyDi-QA and SD-QA): specifically, it uses portions of the SD-QA dataset, which recorded dialectal variations of TyDi-QA questions. The participants may either (a) use the baseline automatic speech recognition outputs for each dialect with the aim of making a robust text-based QA system, or (b) they may use the provided audio recordings of the questions with the aim of making a dialect-robust ASR system which can be then evaluated with a baseline QA system, or (c) both of the above. The shared task provides development and test data for 5 varieties of English (Nigeria, USA, South India, Australia, Philippines), 4 varieties of Arabic (Algeria, Egypt, Jordan, Tunisia), and 2 varieties of Kiswahili (Kenya, Tanzania), as well as code for training baseline systems with modified TyDi-QA data.

Submission types: Any training data are allowed, except for the TyDi-QA data in the above 3 languages.

Data and code: https://github.com/ffaisal93/DialQA

Dates

  • Training set release: May 20, 2022

  • Test set release: June 30, 2022

  • Submissions due: July 6, 2022

  • Paper submission deadline: July 29, 2022

  • Notification of acceptance: Aug 22, 2022

  • Camera-ready papers due: Sept 5, 2022


Paper Submission

Participants are invited to submit a paper to VarDial 2022 describing their systems.

The paper will be presented at the workshop and it will be included in the VarDial proceedings available at the ACL Anthology.

Your paper should be maximum 9 pages long + references and it should be formatted according to the COLING guidelines. System paper submissions are single-blind. You should include your names and affiliation in the paper.

Please upload your paper by July 29, 2022 (anywhere in the world) on START.