Evaluation Campaign

VarDial Evaluation Campaign

Together with VarDial 2021 we organized an evaluation campaign with multiple shared tasks.

The campaign is now over. Thank you for participation! Paper submission information is available below.

For information about the previous campaigns please check the campaign reports from 2020, 2019, 2018, and 2017.

Tasks

Dravidian Language Identification (DLI)

Organizers: Bharathi Raja Chakravarthi (National University of Ireland Galway, Ireland), Ruba Priyadharshini (ULTRA Arts and Science College, Madurai, India), and Eswari Rajagopal (National Institute of Technology Tiruchirappalli, India)

Contact: bharathiraja.akr(at)gmail.com

Dravidian languages are a language family spoken mainly in the south of India. The four major literary Dravidian languages are Tamil (ISO 639-3: tam), Telugu (ISO 639-3: tel), Malayalam (ISO 639-3: mal), and Kannada (ISO 639-3: kan). Tamil, Malayalam, and Kannada are closely related belonging to the South Dravidian subgroup. The DLI shared task provides participants with a collection of 16,672 YouTube comments as training set. The comments contain code-mixed sentences with English and one of the South Dravidian language (Tamil, Malayalam or Kannada). All comments were written in Roman script (Non-native script). The task is to identify the language of each comment.

Submission type: Closed, Open, and Open-full


Romanian Dialect Identification (RDI)

Organizers: Radu Tudor Ionescu and Mihaela Gaman (University of Bucharest, Romania)

Contact: raducu.ionescu(at)gmail.com

In this second iteration of the Romanian Dialect Identification (RDI) shared task we provide participants with an augmented version of the MOROCO data set for training, which contains Moldavian (MD) and Romanian (RO) samples of text collected from the news domain. A new test set has been collected which will allow participants to improve the results they obtained in VarDial 2020. The task is a binary classification by dialect, in which a classification model is required to discriminate between the Moldavian (MD) and the Romanian (RO) dialects. The task is closed, therefore, participants are not allowed to use external data to train their models. The test set will contain newly collected text samples, not previously included in MOROCO. The test samples will come from a different domain, hence the methods have to take the cross-domain nature of the task into account.

Submission type: Closed and Open


Social Media Variety Geolocation (SMG)

Organizers: Yves Scherrer (University of Helsinki, Finland), Nikola Ljubešić (Jožef Stefan Institute, Slovenia and University of Zagreb, Croatia), Christoph Purschke (University of Luxembourg, Luxembourg)

Contact: yves.scherrer(at)helsinki.fi

In this second iteration of the SMG task, we again focus on a geolocation (rather than identification) task: given a text, the participants have to predict its geographic location in terms of latitude/longitude coordinates. Using data from the social media platforms Twitter and Jodel, we provide extended datasets for the same three subtasks as in 2020:

  • Standard German Jodels: Jodel is a mobile chat application that lets people anonymously talk to other users within a 10km-radius around them. This subtask focuses on Jodel conversations initiated in Germany and Austria, which are written in standard German but commonly contain regional and dialectal forms (Hovy & Purschke 2018).

  • Swiss German Jodels: Hovy & Purschke (2018) also collected Jodel conversations from Switzerland, which were found to be held majoritarily in Swiss German dialects. This subtask will rely on a considerably smaller dataset, but we expect it to contain more dialect-specific cues than the standard German one.

  • BCMS Tweets: This task is focused on geolocated tweets published in the area of Croatia, Bosnia and Herzegovina, Montenegro and Serbia in the so-called BCMS macro-language (ISO acronym HBS, code 639-3). While the independent status of the specific languages is rather disputed, there is significant variation between them which will surely be of assistance in the task at hand.

All three subtasks will use the same data format and evaluation methodology, and participants are encouraged to submit their systems for all subtasks.

Submission type: Closed and Open


Uralic Language Identification (ULI)

Organizers: Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, and Krister Lindén (University of Helsinki, Finland)

Contact: tommi.jauhiainen(at)helsinki.fi

This task focuses on discriminating between the languages in the Uralic group as defined by the ISO 639-3 standard. This is an open public leaderboard competition following VarDial 2020 where participants can submit at any point until the final submission date. Visit this page to see the leaderboard and get more information. The task includes 29 individual relevant languages, some of which are extremely closely related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish (fit). These languages are used from Scandinavia, Estonia, and Finland all the way to the Russian Siberia.

Submission type: Closed


Submission Types

There are three types of submissions in the VarDial 2021 tasks as follows:

  • Open-full: participants are allowed to use any kind of resource including additional labeled training data.

  • Open: participants are allowed to use external resources such as unlabelled corpora, lexicons, and pre-trained embeddings (e.g. BERT) but the use of additional labeled data is not allowed.

  • Closed: no additional resource or pre-trained model allowed.

The submission type(s) allowed for each task is indicated in the respective task description.


Dates

  • Training set release: December 22, 2020

  • Test set release: January 25, 2021

  • Submissions due: January 28, 2021 February 2, 2021 (extended)

  • Paper submission deadline: February 15, 2021 February 19, 2021 (extended)

  • Notification of acceptance: February 26, 2021

  • Camera-ready papers due: March 3, 2021


Paper Submission

Participants are invited to submit a paper to VarDial 2021 describing their systems.

The paper will be presented virtually at the workshop and it will be included in the VarDial proceedings available at the ACL Anthology.

Your paper should be maximum 8 pages long + references and it should be formatted according to the EACL 2021 guidelines. System paper submissions are single-blind. You should include your names and affiliation in the paper.

Please upload your paper by February 19, 2021 (anywhere in the world) on START.


Campaign Organization

Marcos Zampieri - Rochester Institute of Technology, USA

Yves Scherrer - University of Helsinki, Finland