VarDial Evaluation campaign
The VarDial Evaluation campaign 2020 is now finished. The results have been e-mail to the participants. Thank you for your participation!
Every year VarDial hosts an evaluation campaign with multiple shared tasks on different topics related to language variation.
Past editions of the campaign included tasks on similar language and dialect identification, morphosyntactic tagging, dependency parsing, and morphological analysis. For more information please check the campaign reports from 2019, 2018, and 2017. The VarDial Evaluation Campaign 2020 features three shared tasks. For more information please register your team using the following link.
RDI - Romanian Dialect Identification
Organizers: Radu Tudor Ionescu and Mihaela Gaman (University of Bucharest)
In the Romanian Dialect Identification (RDI) shared task we provide participants with the MOROCO data set for training, which contains Moldavian (MD) and Romanian (RO) samples of text collected from the news domain. The task is a binary classification by dialect, in which a classification model is required to discriminate between the Moldavian (MD) and the Romanian (RO) dialects. The task is closed, therefore, participants are not allowed to use external data to train their models. The test set will contain newly collected text samples, not previously included in MOROCO. The test samples will come from a different domain, hence the methods have to take the cross-domain nature of the task into account.
SMG - Social Media Variety Geolocation
Organizers: Yves Scherrer (University of Helsinki), Dirk Hovy (Bocconi University), Nikola Ljubešić (Jožef Stefan Institute and University of Zagreb), Christoph Purschke (University of Luxembourg)
Most existing VarDial tasks are language identification tasks: they are framed as classification tasks in which each instance is associated with a language variety label. For many language areas, defining a set of discrete labels is not trivial, as there is a continuum between varieties rather than clear-cut borders. Therefore, we introduce a geolocation task this year: given a text, the participants have to predict its geographic location in terms of latitude/longitude coordinates. Geolocation can be framed as a double regression task, but more sophisticated model architectures have been proposed (e.g., Rahimi et al. 2017a, 2017b).
Using data from the social media platforms Twitter and Jodel, we provide three subtasks for three language areas:
Standard German Jodels: Jodel is a mobile chat application that lets people anonymously talk to other users within a 10km-radius around them. This subtask focuses on Jodel conversations initiated in Germany and Austria, which are written in standard German but commonly contain regional and dialectal forms (Hovy & Purschke 2018).
Swiss German Jodels: Hovy & Purschke (2018) also collected Jodel conversations from Switzerland, which were found to be held majoritarily in Swiss German dialects. This subtask will rely on a considerably smaller dataset, but we expect it to contain more dialect-specific cues than the standard German one.
BCMS Tweets: This task is focused on geolocated tweets published in the area of Croatia, Bosnia and Herzegovina, Montenegro and Serbia in the so-called BCMS macro-language (ISO acronym HBS, code 639-3). While the independent status of the specific languages is rather disputed, there is significant variation between them which will surely be of assistance in the task at hand.
All three subtasks will use the same data format and evaluation methodology, and participants are encouraged to submit their systems for all subtasks.
ULI - Uralic Language Identification
Organizers: Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, and Krister Lindén (University of Helsinki, Finland)
This task focuses on discriminating between the languages in the Uralic group as defined by the ISO 639-3 standard. The task includes 29 individual relevant languages, some of which are extremely closely related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish (fit). These languages are used from Scandinavia, Estonia, and Finland all the way to the Russian Siberia. Many of the languages used within Russia are written using modified Cyrillic alphabets. Most of the included languages can be defined as under-resourced, for example, Karelian (krl) and Livvi-Karelian (olo), which have less than 40,000 native speakers combined. Even more challenging examples are Nganasan, with estimated 125 speakers and very limited online presence, and Kemi Sami, which is extinct and even scarcely documented. We acknowledge that the ISO 639-3 classification which we have used may not be without problems, but especially within the purposes of this shared task it identifies these 29 language varieties adequately. Three tracks are available in this shared task. More information and data available here.
Training set release: April 30, 2020.
Test set release:
July 10, 2020July 20, 2020
July 17, 2020July 30, 2020
Paper submission deadline:
August 15, 2020September 7, 2020
Notification of acceptance:
September 14, 2020October 5, 2020
Camera-ready papers due:
October 5, 2020October 15, 2020
Marcos Zampieri - Rochester Institute of Technology
Yves Scherrer - University of Helsinki