Scope and Objective
Being the largest continental area on Earth, Eurasia is home to many language families and sub-language families, e.g. Afro-Asiatic (Semitic), Austroasiatic, Caucasian, Celtic, Chukchi-Kamchatkan, Dravidian, Eskimo–Aleut, Indo-European, Japonic, Koreanic, Mongolic, Nivkh, Sino-Tibetan, Tai-Kradai, Turkic, Tungusic, Uralic and Yeniseian. It is a hub of more than 2,018 languages. Despite the rich diversity of these languages, various language communities in Eurasia are under-represented, minoritised, endangered and systematically oppressed politically. As a result, many of these languages, such as Kurdish, Gilaki, Santali, Kashmiri, Laz, and Abkhaz are low-resource and many are endangered with very few studies carried out on them, as is the case with, for example, Shabaki, Talysh, Domari, Korbet and Bawm.
One interesting characteristic of these languages is the influence of communal languages on their lexicon through borrowed words or cognates. Furthermore, such an influence may, to some extent, be observed in the syntax of the less-resourced language(s) in question, often being typologically different and belonging to a different language family. Relying on a lingua franca, many of these linguistic communities are facing standardization issues, particularly in the written forms. In many cases, as a result, scripts of other languages are used by the speakers of an under-represented language.
This workshop will focus on the development of language technology resources and tools for indigenous, endangered and lesser-resourced languages on the Eurasian continent.
In a media-centric world where language technology allows people to break cultural and language barriers, it is important that speakers of endangered and indigenous languages can be empowered to use this technology to continue to share their knowledge and culture with the world. With the hope of bridging this gap, the goal of this workshop is to increase visibility and promote research for lesser-resourced and underrepresented languages in Europe and Asia. Through collaboration between NLP researchers, language experts and linguists working for the benefit of endangered languages in these communities, we aim to create language technology resources that will help to preserve and revive these languages for future generations. Furthermore, the workshop aims to promote the emergence of new methods that benefit linguists, for instance for automation of analysis and validation processes, field linguists, the facilitation of data collection and analysis processes, and computational linguists by developing new techniques necessary for linguistic analysis, development of supervised or weakly supervised methods for the analysis of poorly written or undocumented languages.
The main objective of the workshop is to create basic resources and develop tools for Eurasian languages. We invite contributions focusing on, but not restricted to, the following aspects of language technology for indigenous, endangered and less-resourced languages of Eurasia, including but not limited to the following topics:
identifying languages and variants spoken in these regions
creation of language resources and applications, e.g. sentiment analysis, named entity recognition, and syntactic parsing
standardization for endangered languages
automatic identification and classification of lexical variation and language varieties
adaptation of fundamental NLP tools for these languages, e.g., morphological analysis, taggers and parsers
reusability of language resources in NLP applications, e.g. machine translation, and POS tagging
machine translation between closely related languages
evaluation of language resources and tools when applied to lesser-resourced languages in the same language families
corpora, resources, and tools for closely related languages
linguistic and textual similarities among languages in Eurasia
digitalization of endangered languages
challenges in the creation of language resources and tools from linguistic perspectives (which includes any perspective formal theory)
Identify, Describe, and Share your LRs!
Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.
As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, ELRA encourages all LREC-COLING authors to endorse the need to uniquely identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC-COLING 2024 papers will be offered at submission time.