Organizers: Yves Scherrer, Petter Mæhlum, Marthe Løken Midtgaard (University of Oslo), Rob van der Goot (IT University of Copenhagen)
Contact: yves.scherrer@ifi.uio.no
The NoMusic corpus (Mæhlum & Scherrer 2024) provides examples of prompts to digital assistants in ten Norwegian dialects as well as Standard Bokmål. Each prompt is annotated with:
A dialect label
An intent label
A list of slot spans.
The following example (English: How warm will it be today?) illustrates this:
# text = Kor varmt skal det ver i dag?
# intent = weather/find
# dialect = V
1 Kor weather/find O
2 varmt weather/find B-weather/attribute
3 skal weather/find O
4 det weather/find O
5 ver weather/find O
6 i weather/find B-datetime
7 dag weather/find I-datetime
8 ? weather/find O
The intent label is weather/find, the dialect label is V (West), and there are two slots: varmt with label weather/attribute, and i dag with label datetime.
This shared task is composed of three subtasks:
dialect identification (one or several labels per prompt),
intent identification (one label per prompt),
slot detection (one BIO-tag per token).
Participants may submit results for some or all subtasks.
The xSID-0.6 corpus (https://github.com/mainlp/xsid/tree/main/data/xSID-0.6) provides training data in multiple languages annotated with slots and intents. The original training data is in English (from Snips and Facebook) and has been automatically translated (and the slot and intent labels projected) to a number of languages, including the closely related language Danish. Participants are allowed to use any other relevant resources, as long as they are not annotated for slots and intents in Norwegian. Specifically, the following resources are allowed:
Annotated training and development data (but not test data!) from all languages in the xSID-0.6 corpus
The automatically translated Norwegian Bokmål training set
The shared task development set
Raw text data from standard and non-standard Norwegian varieties (e.g. Wikipedia dumps, web crawls, social media data, cf. also below)
Pretrained language models supporting Norwegian
We do not provide specific training data, but participants may use the development set for training. We encourage participants to retrieve additional data for this subtask. Dialectal Norwegian text may be found, for example, in the following resources:
The Nordic Dialect Corpus (dialectological transcriptions): https://tekstlab.uio.no/scandiasyn/download.html
The LIA Corpus (dialectological transcriptions): https://tekstlab.uio.no/LIA/korpus.html
Nordic Tweet Stream (geolocalized tweets): https://nordictweetstream.fi/
NorDial: https://github.com/jerbarnes/nordial/tree/main/tweet_level/data
We provide a development set with 3300 utterances (see example above). This is a concatenated and shuffled version of the 11 Norwegian translations of the xSID development set. Participants are allowed to use the development set (or parts thereof) for training.
During the test period, participants will receive a test set with 5500 utterances, corresponding to the concatenated and shuffled version of the 11 Norwegian translations of the xSID test set.
The submitted systems will be evaluated with span F1-score for the slots, accuracy for the intents, and weighted f1-score for the dialects. We will also provide dialect-specific slot and intent evaluation scores to assess the systems’ robustness to dialectal variation.
Please register your interest by filling out this form: https://forms.gle/Ji6jAAEHyZjSE4d58
This will allow the organizers to stay in touch with you and inform you about updates regarding the shared task. Registration is not binding.
Training and development set release: October 7, 2024
xSID 0.6: https://github.com/mainlp/xsid/tree/main/data/xSID-0.6
Norwegian training set and NorSID development set: https://github.com/ltgoslo/NoMusic/tree/main/NorSID
Test set release: November 5, 2024
Submissions due: November 15, 2024
Paper submission deadline: November 25, 2024
Notification of acceptance: December 5, 2024
Camera-ready papers due: December 13, 2024
Paper submission: https://softconf.com/coling2025/VarDial25/