VarDial Evaluation Campaign
The Third VarDial Evaluation Campaign includes five shared tasks listed below. It follows the First VarDial Evaluation Campaign at VarDial 2017 which featured four shared tasks, and the Second VarDial Evaluation Campaign at VarDial 2018 which featured five shared tasks. Before the evaluation campaigns, past editions of VarDial featured the DSL shared tasks which focused on the identification of similar languages and language varieties such as the DSL 2014, DSL 2015, and the DSL 2016 which also included Arabic dialects.
The list of shared tasks included in this year's campaign will be available soon. To participate please fill this registration form.
German Dialect Identification (GDI)
- Task Organizers: Yves Scherrer (University of Helsinki, Finland) and Tanja Samardžić (University of Zurich, Switzerland)
- Contact: yves.scherrer(at)gmail.com
- Task Description: After two successful editions of the (Swiss) German Dialect Identification task, we propose to organize a third iteration of this task. We will again focus on four Swiss German dialect areas (Basel, Bern, Lucerne, Zurich). We provide updated speech transcripts for all dialect areas and also release corresponding acoustic data in the form of iVectors as well as (predicted) word-level normalisation. In particular, the acoustic data may help to overcome transcriber bias; the recent iterations of the ADI task have already shown that acoustic features substantially improve dialect identification.
- Training Type: Closed
- Acknowledgments: The GDI organizers thank Thayabaran Kathiresan and Lei He (University of Zurich) for their help with the preparation of the iVectors.
Cross-lingual Morphological Analysis (CMA)
- Task Organizers: Francis Tyers (Indiana University, United States) and Miikka Silfverberg (University of Helsinki, Finland)
- Contact: ftyers(at)prompsit.com
- Task Description: We introduce the task of cross-lingual morphological analysis. Given a word in an unknown related language, for example "navifraghju" ("shipwreck" in Corsican), a human speaker of several related languages is able to deduce that it is a noun in the singular by making deductions from similar words, for example: "naufragi" (Catalan), "naufragio" (Spanish, Italian), "naufrágio" (Portuguese) and "naufrage" (French). In this task we invite participants to create computational models which will be able to do the same. There will be two language families represented, Romance (fusional morphology) and Turkic (agglutinative morphology). In the "Closed" track, participants will be given a set of word forms with all valid morphological analyses in six languages and asked to predict the valid morphological analyses for a seventh, unseen language. In the "Semi-Closed" track, the process will be the same, only participants will be provided with additional raw data by the organisers. This will take the form of raw text Wikipedia dumps, bilingual dictionaries from the Apertium project and any treebanks available in the known languages from the Universal Dependencies project.
- Training Types: Closed and Semi-Closed
Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT)
- Task Organizers: Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang (The Hong Kong Polytechnic University, Hong Kong)
- Contact: natalka.kljueva(at)gmail.com
- Task Description: Like English, Mandarin has several varieties among the speaking communities. This task aims at discriminating between two major varieties of Mandarin Chinese: Putonghua (Mainland China) and Guoyu (Taiwan). We provide a corpus of approximately 10,000 sentences belonging to the domain of news for each of the Mandarin variation. The main task will be to determine if the sentence belongs to news articles from Mainland China or from Taiwan. The sentences are tokenized and punctuation is removed from the texts. Both the traditional and the simplified versions of the same corpus are available and the results will be evaluated in two separate tracks (Simplified and Traditional).
- Training Type: Closed
Moldavian vs. Romanian Cross-dialect Topic identification (MRC)
- Task Organizers: Radu Ionescu and Andrei Butnaru (University of Bucharest, Romania)
- Contact: raducu.ionescu(at)gmail.com
- Task Description: In the Moldavian vs. Romanian Cross-topic Identification shared task we provide participants with the MOROCO data set which contains Moldavian and Romanian samples of text collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports, tech. The samples are preprocessed in order to eliminate named entities. For each sample, the data set provides corresponding dialectal and category labels. To this end, we propose three sub-tasks for the 2019 VarDial Evaluation Campaign. The first sub-task is a binary classification by dialect task, in which a classification model is required to discriminate between the Moldavian and the Romanian dialects. The second sub-task is a Moldavian to Romanian cross-dialect multi-class classification by topic task, in which a model is required to classify the samples written in the Romanian dialect into six topics, using samples written in the Moldavian dialect for training. Finally, the third sub-task is a Romanian to Moldavian cross-dialect multi-class classification by topic task, in which a model is required to classify the samples written in the Moldavian dialect into six topics, using samples written in the Romanian dialect for training.
- Training Type: Closed
Cuneiform Language Identification (CLI)
- Task Organizer: Tommi Jauhiainen (University of Helsinki, Finland)
- Contact: tommi.jauhiainen(at)helsinki.fi
- Task Description: This task focuses on discriminating between languages and dialects originally written using the cuneiform signs. The task includes 2 different languages: Sumerian and Akkadian. Furthermore, the Akkadian language is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. These languages and dialects were used in ancient Mesopotamia and span a time period of 3,000 years. For training and development, we provide the participants with varying amounts of text encoded in Unicode cuneiform signs for each language or dialect. We are interested in seeing whether the task of language identification between dialects using the same logosyllabic writing system is different from language identification between languages using segmental scripts.
- Training Type: Closed
- Training set release: February 5, 2019
- Test set release: March 5, 2019
- Evaluation Phase: March 4 to 13, 2019
- March 4 (00:01 GMT) to March 5 (23:59 GMT) - Cuneiform Language Identification (CLI)
- March 5 (00:01 GMT) to March 8 (23:59 GMT) - Cross-lingual Morphological Analysis (CMA)
- March 6 (00:01 GMT) to March 7 (23:59 GMT) - Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT)
- March 8 (00:01 GMT) to March 11 (23:59 GMT) - Moldavian vs. Romanian Cross-dialect Topic identification (MRC)
- March 12 (00:01 GMT) to March 13 (23:59 GMT) - German Dialect Identification (GDI)
- System papers deadline: March 25, 2019
- Review feedback: April 5, 2019
- Camera-ready versions: April 10, 2019
The evaluation campaign general organizers are Marcos Zampieri (University of Wolverhampton, UK) and Shervin Malmasi (Amazon, USA).
For task-specific questions please contact the respective task organiser(s). For general questions about the VarDial evaluation campaign please contact Marcos Zampieri - m.zampieri(at)wlv.ac.uk