CoLI-Dravidian Shared Task@FIRE2024
Language Identification (LI) involves detecting the language(s) used in a given text, which is a preliminary step for many applications such as sentiment analysis, machine translation, information retrieval, and natural language understanding. In multilingual India, especially among the youth, social media often features code-mixed text, blending local languages with English at various levels. However, this poses significant challenges for LI, particularly when languages are mixed within a single word. Dravidian languages, extensively spoken in southern India, are under-resourced despite their rich morphological structure. These languages face technological challenges, especially in script representation on digital platforms, leading users to prefer Roman or hybrid scripts for communication. This prevalent code-mixing offers vast linguistic data for research, yet remains understudied.
To address word-level LI challenges in Dravidian languages, we are conducting a shared task by providing datasets for four languages -- Kannada, Tamil, Malayalam, and Tulu -- encouraging the development of advanced LI models.
For registration, please visit this link