CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Texts
Task Description
Language Identification (LI) refers to the automated process of identifying the languages used in a given text. This process is often used as a preliminary step for many applications, such as sentiment analysis, machine translation, information retrieval, and natural language understanding. LI at word- level can be modeled as a Sequence Labeling task of assigning language labels to each word in a sentence from a predefined set of languages. Although much research has been conducted in LI, the challenge of identifying languages in code-mixed scenarios remains unresolved.
Tulu is the regional language and Kannada is the official language of Karnataka in India and Tuluvas (people whose mother tongue is Tulu) usually know both Tulu and Kannada languages fluently to read, write, and speak. In addition, many Kannada words are used in Tulu language. Further, English is predominantly known by many Tulu speaking people, especially those who are active on social media platforms. Tulu songs, videos, movies, comedy programs, and skits are popular on social media. The comments posted by Tulu users for Tulu programs on social media will usually be a code-mix of Tulu, Kannada, and English. Even though Tuluvas are proficient in reading, writing, and speaking Tulu, many of them face difficulties in using the Kannada script to post messages or comments on social media due to the technological limitations of keyboards/keypads on computers/smartphones. Added to this, the complexity of framing words with consonant conjuncts makes it challenging to use Kannada script for writing Tulu text. As a result, many users resort to using only Roman script or a combination of both Kannada and Roman script to post comments on social media. This has generated a lot of trilingual code-mixed data which is rarely explored for research purpose.
To address word-level LI in code-mixed Tulu-English (Tu-En) texts, these texts are extracted from Tulu YouTube video comments to construct Code-mixed Tulu-English Language Identification (CoLI-Tunglish) dataset. We encourage participants to use CoLI-Tunglish dataset which consists of Tulu, Kannada, English and mixed language words, in Roman script and submit their methods to CoLI-Tunglish shared task where each word will be identified and categorized in to one of the predefined categories.
Task Registration
For the registration, please fill-up the participant details in this form.
Contact us
colitulufire2023@gmail.com