The CoLI-Tunglish dataset consists of Tulu, Kannada, and English words in Roman script and are grouped into seven major categories, namely, "Tulu", “Kannada”, “English”, “Mixed-language”, “Name”, “Location” and “Other”. Each word in the Test set has to be assigned with one of these seven categories.
Table 1 presents the description of labels in CoLI-Tunglish dataset.
Table 1: Description and samples of tokens in the code-mixed Tulu dataset for LI
Table 2 presents the label distribution in the training set. The statistics for the test set will be released later.