The task of automatically identifying languages used in a given text is called Language Identification (LI). LI is a pre-processing step for many applications and LI at the word level can be viewed as a sequence labeling problem where each and every word in a sentence is tagged with either a mixed language or one of the languages in the predefined set of languages. Despite a lot of work being done in LI, the problem of LI in the code-mixed scenario is still a long way from being illuminated.
India has a rich heritage of languages and Kannada is one of the Dravidian languages as well as the official language of Karnataka state. People of Karnataka read, write and speak Kannada but many find it difficult to use Kannada script to post messages or comments on social media.
While technological limitations like keyboards of computers and smartphones are one reason, another reason may be the complexity of framing words with consonant conjuncts. Hence, most of the users use only Roman script or a combination of both Kannada and Roman script to post comments on social media. To address word level LI in code-mixed Kannada-English (Kn-En) texts, these texts are extracted from Kannada YouTube video comments to construct Code-mixed Language Identification (CoLI-Kenglish) dataset.
We encourage participants to use the CoLI-Kenglish dataset which consists of English, Kannada and mixed language words, in Roman script and submit their methods to Kanglish shared task where each word will be identified and categorized in one of the predefined categories.
For the registration, please fill-up the participant details in this form.