Dataset

The CoLI-Tunglish dataset consists of Tulu, Kannada, and English words in Roman script and are grouped into seven major categories, namely, "Tulu", “Kannada”, “English”, “Mixed-language”, “Name”, “Location” and “Other”. Each word in the Test set has to be assigned with one of these seven categories.

Table 1 presents the description of labels in CoLI-Tunglish dataset.

Table 1: Description and samples of tokens in the code-mixed Tulu dataset for LI

Dataset Statistics

Table 2 presents the label distribution in the training set. The statistics for the test set will be released later.

Train set: Download

Development set: Download

Test set (without label): Download

Test set (with label): Download

Page updated

Google Sites

Report abuse