Datasets

The shared task includes four language tracks with the following datasets:

Tulu: The Tulu dataset contains 7,171 code-mixed sentences scraped from YouTube videos, pre-processed to remove non-text characters and transliterated into Roman script. From these, 36,002 words are categorized into six classes: 'Tulu', 'Kannada', 'English', 'Mixed-language', 'Name', and 'Location', posing challenges due to the dynamic and individualistic nature of mixed-language words.

Kannada: This dataset includes 14,847 tokens in Roman script, classified into six categories: 'Kannada', 'English', 'Mixed-language', 'Name', 'Location', and 'Other'. It aims to improve language identification and categorization methods for Kannada-English code-mixed texts.

Tamil: Comprising 17,568 tokens, the Tamil dataset follows a similar methodology to Tulu and Kannada datasets, categorized into six classes. It supports various natural language processing tasks within the Tamil language domain.

Malayalam: The Malayalam dataset contains 25,035 tokens, divided into seven classes: 'Malayalam', 'English', 'Mixed', 'Name', 'Number', 'Location', and 'sym' for sentence boundaries. It offers comprehensive coverage for NLP tasks, similar to the other datasets, but includes additional categories like 'Number' for numerical values.

The following table presents some of the sample tokens in Kannada dataset.

Page updated

Google Sites

Report abuse

Datasets

Questions?