ULI Shared Task

As training data for the relevant languages, we use the Wanca 2016 corpus (http://urn.fi/urn:nbn:fi:lb-2020022901). In total, the corpus contains 646,043 unique sentences, ranging from 19 sentences of Kemi Sami to 214,225 sentences of Northern Sami. The source version of the corpus can be downloaded from http://urn.fi/urn:nbn:fi:lb-2020022902. The test data includes new sentences from the yet unpublished Wanca 2017 corpus and will be provided to the participants by the task organizers in the beginning of the evaluation period. Not all of the 29 relevant languages in the training set are attested in the test set: the distribution of languages in the test set is close to the actual distribution of new sentences in the forthcoming Wanca 2017 corpus.

In addition to the relevant languages, the test set includes sentences in 149 other languages. The three largest Uralic languages have been included into this category. The download links for the training data for these non-relevant languages are distributed by the task organizers only to participating teams. In total, the training data for this task consists of 63,772,445 sentences in non-relevant and 646,043 sentences in relevant languages, totaling 64,418,488 sentences.

Both, the training data for the relevant and non-relevant languages must be considered as noisy, e.g. there will be incorrectly labeled sentences (not intentionally, though). The Wanca 2016 corpus includes a http-address for each sentence and the form of these addresses themselves can be used in the task as well. For example, our current pipeline allows only one of two close languages to be found from the same page and this kind of information can be used to clean the corpora if deemed helpful by the participants.

The shared task is divided in three different tracks. All of the tracks are closed, so no other data or models can be used for training in addition to the 64,418,488 sentences in the training set. All the tracks use the same training data.

Track 1: ULI-RLE (Relevant languages as equals)

The first track of the shared task considers all the relevant languages equal in value and the aim is to maximize their average F-score. This is important when one is interested to find also the very rare languages included in the set of relevant languages. The F-score is calculated as a macro-F1 score over the relevant languages in the training set. E.g. if you predict relevant languages in the test set that are not supposed to be there at all, your precision and thus your F1-score for that language goes to zero. The result is the average of the F1-scores of all the 29 relevant languages.

Track 2: ULI-RSS (Relevant sentences as equals)

The second track considers each sentence in the test set that is written in or is predicted to be in a relevant language as equals. When compared to the first track, this track gives less importance to the very rare languages as their precision is not so important when the resulting F-score is calculated. The resulting F-score is calculated as a micro-F1 over the sentences in the test set for sentences in the relevant languages as well as those that you have predicted to be in relevant languages.

Track 3: ULI-178 (All 178 languages as equals)

In the first two tracks, there is no difference between the non-relevant languages when the F1-scores are calculated. The third track, however, does not especially concentrate on the 29 relevant languages, but instead the target is to maximize the average F-score over all the 178 languages present in the training set. This track will be the LI shared task with the largest number of languages to date (ALTW 2010 included 74 languages). The F-score is calculated as a macro-F1 score over all the languages in the training set.

The training set contains sentences in the 178 languages below.

The 29 relevant languages are:

  • fit Tornedalen Finnish (meänkieli)
  • fkv Kven (kvääni)
  • izh Ingrian (ižoran keel)
  • kca Khanty (ханты ясанг)
  • koi Komi-Permyak (перем коми кыв)
  • kpv Komi-Zyrian (Коми кыв)
  • krl Karelian (karjal)
  • liv Liv (līvõ kēļ)
  • lud Ludian (lüüdin kiel')
  • mdf Moksha (мокшень)
  • mhr Eastern and Meadow Mari (марий йылме)
  • mns Mansi (мāньси лāтыӈ)
  • mrj Western or Hill Mari (Кырык мары)
  • myv Erzya (эрзянь)
  • nio Nganasan (ня”)
  • olo Livvi (Olonets / livvin karjal)
  • sjd Kildin Sami (Кӣллт са̄мь кӣлл)
  • sjk Kemi Sami (samääškiela)
  • sju Ume Sami (uumajanlappi)
  • sma Southern Sami (åarjel-saemien)
  • sme Northern Sami (davvisámi, davvisámegiella)
  • smj Lule Sami (julevsábme)
  • smn Inari Sami (anarâškielâ)
  • sms Skolt Sami (sää´mǩiõll)
  • udm Udmurt (удмурт кыл)
  • vep Veps (vepsän kel')
  • vot Votic (vad̕d̕a ceeli)
  • vro Võro (võro kiil)
  • yrk Nenets (ненэцяʼ вада)

The 149 irrelevant languages are:

Afrikaans (afr), Tosk Albanian (als), Amharic (amh), Arabic (ara), Assamese (asm), North Azerbaijani (azj), Bashkir (bak), Bavarian (bar), Central Bikol (bcl), Belarusian (bel), Bengali (ben), Bosnian (bos), Bishnupriya (bpy), Breton (bre), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Czech (ces), Chechen (che), Chuvash (chv), Mandarin Chinese (cmn), Corsican (cos), Welsh (cym), Danish (dan), German (deu), Dimli (diq), Dhivehi (div), Standard Estonian (ekk), Modern Greek (ell), English (eng), Esperanto (epo), Basque (eus), Extremaduran (ext), Faroese (fao), Finnish (fin), French (fra), Western Frisian (fry), Irish (gle), Galician (glg), Manx (glv), Goan Konkani (gom), Guarani (grn), Swiss German (gsw), Gujarati (guj), Haitian (hat), Hebrew (heb), Fiji Hindi (hif), Hindi (hin), Croatian (hrv), Upper Sorbian (hsb), Hungarian (hun), Ido (ido), Iloko (ilo), Interlingua (ina), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kalaallisut (kal), Kannada (kan), Georgian (kat), Kazakh (kaz), Kirghiz (kir), Korean (kor), Karachay-Balkar (krc), Kölsch (ksh), Latin (lat), Latvian (lav), Limburgan (lim), Lithuanian (lit), Lombard (lmo), Luxembourgish (ltz), Ganda (lug), Lushai (lus), Malayalam (mal), Marathi (mar), Minangkabau (min), Macedonian (mkd), Malagasy (mlg), Maltese (mlt), Mongolian (mon), Maori (mri), Mirandese (mwl), Mazanderani (mzn), Low German (nds), Nepali (nep), Newari (new), Dutch (nld), Norwegian Nynorsk (nno), Norwegian Bokmål (nob), Pedi (nso), Occitan (oci), Oriya (ori), Ossetian (oss), Pampanga (pam), Panjabi (pan), Iranian Persian (pes), Pfaelzisch (pfl), Piemontese (pms), Western Panjabi (pnb), Polish (pol), Portuguese (por), Pushto (pus), Quechua (que), Romansh (roh), Romanian (ron), Russian (rus), Yakut (sah), Sicilian (scn), Scots (sco), Samogitian (sgs), Sinhala (sin), Slovak (slk), Slovenian (slv), Shona (sna), Somali (som), Southern Sotho (sot), Spanish (spa), Sardinian (srd), Serbian (srp), Sundanese (sun), Swahili (swa), Swedish (swe), Tamil (tam), Tatar (tat), Telugu (tel), Tajik (tgk), Tagalog (tgl), Thai (tha), Tsonga (tso), Turkmen (tuk), Turkish (tur), Uighur (uig), Ukrainian (ukr), Urdu (urd), Northern Uzbek (uzn), Venetian (vec), Vietnamese (vie), Vlaams (vls), Volapük (vol), Walloon (wln), Wu Chinese (wuu), Xhosa (xho), Mingrelian (xmf), Yiddish (yid), Zeeuws (zea), Standard Malay (zsm), Zulu (zul).