Dataset

The CoLI-Kenglish dataset consists of English and Kannada words in Roman script and are grouped into six major categories, namely, “Kannada”, “English”, “Mixed-language”, “Name”, “Location” and “Other” and submit their methods in Kanglish shared task where each word will be identified and categorized in one of earlier mentioned categories.

Table 1 presents the description of labels in CoLI-Kenglish dataset.

Train set: Download

Test set (without label): Download

Test set (with label): Download

We are also sharing some raw code-mixed texts in Kannada-English language pair, that you may want to use it for training word embedding or language models as a part of your methodology.

Raw texts: Download

Dataset Statistics

Table 2 presents the labels distribution in train set. The statistics for test set will be released later.

Page updated

Google Sites