Abstract
Language Identification (LI) in text processing traditionally focus on automatically detecting languages within a document, predominantly in high-resource languages such as English, Spanish, German, and French. However, recent technological advancements have shed light on the significance of addressing LI challenges in multilingual countries like India, where people blend their mother tongue or local/regional language with English creating code-mixed comments. Indian languages in general, and Dravidian languages, in particular, are marginalized in LI because of lack of resources. To address this issue, we intend to organize a tutorial on "Word-level Language Identification in Code-mixed Dravidian Languages" that aims to address word-level LI in code-mixed Dravidian language texts (Kannada, Tulu, and Tamil). The tutorial covers a talk on word-level LI in Dravidian languages and hands-ons (demo codes).
Language Identification (LI) is the task of automatically recognizing the language in a given text. It plays a pivotal role in various Natural Language Processing (NLP) applications such as sentiment analysis, machine translation, information retrieval and so on. The widespread use of social media has introduced code-mixed language which includes mixing languages at sub-word, word, sentence and paragraph levels. Hence, code-mixed text by default is multilingual and this requires the identification of language at sub-word, word, sentence and paragraph, depending upon the application. Despite substantial research in this area, the difficulty of accurately identifying languages in code-mixed contexts remains unsolved. Code-mixing, which involves the seamless blending of multiple languages within a single utterance, presents a complex challenge due to the need to distinguish and categorize intertwined language segments, hampering existing LI techniques. The complexity arises when words emerge from a combination of root words and prefixes/suffixes from different languages, often leading to phonetic conflicts.
India being a multilingual country showcases a diverse linguistic landscape, and a considerable segment of its population, especially the youth, is adept at using both English and local languages. This linguistic amalgamation has led to the prevalence of code-mixed text, particularly evident on social media platforms. Notably, code-mixed text in India typically employs the Roman script for non-English languages, sometimes coupled with native scripts, creating intricate challenges for analysis and processing. Further, Indian languages in general and Dravidian languages in particular, are under-resourced languages in which lack of resources is the common issue. Addressing these challenges in LI requires innovative approaches that can decipher and differentiate languages within such intricate linguistic landscapes (Balouchzahi et al., 2022a).
Word-level LI is the identification of language at the lowest linguistic unit but is a significant task in applications involving code-mixed text as users have the tendency to mix up words of two or more languages (one language will usually be English). LI at word-level can be modeled as a sequence labeling problem which involves labeling every term/word in a sequence/sentence with one of the languages in a predefined set of languages. Dravidian languages such as Kannada, Tamil, Telugu, Malayalam, Tulu are low-resource languages and word-level LI in these languages is not explored.
The tutorial comprises the following points:
• What is code-mixed text and why codemixing?
• Role of LI in handling code-mixed text
• Need for word-level LI in code-mixed Dravidian language text
• Feature extraction for word-level LI
• A wide range of models for word-level LI including Machine Learning, Deep Learning and Transfer Learning (hands-on)
By addressing the aforementioned points, the tutorial holds the potential to provide valuable insights for addressing word-level LI challenges specific to Dravidian languages. With a focus on these aspects, this tutorial aims to contribute essential knowledge, assisting researchers and practitioners in developing effective solutions tailored to the unique linguistic characteristics of Dravidian languages.
Broad Categories
• Natural Language Processing
• Under-resourced language
• Code-mixing
• Language Identification
• Machine Learning
Hosahalli Lakshmaiah Shashirekha
Email: hlsrekha@mangaloreuniversity.ac.in
Website: https://mangaloreuniversity.ac.in/shashirekha
Professor, Department of Computer Science,
Mangalore University, Mangalore - 574199, India
Additional Link:
Asha Hegde
Email: hegdekasha@gmail.com
PhD student, Department of Computer Science
Website: https://sites.google.com/view/ashahegde/home
Mangalore University, Mangalore - 574199, India
Additional Link:
The team has a wide experience in processing codemixed, Dravidian and low resource language texts and have successfully organized a shared task on word-level LI in code-mixed Tulu at FIRE 2023 (Balouchzahi et al., 2022; Lakshmaiah et al., 2022).
H. L. Shashirekha is a Professor of Computer Science at Mangalore University, India. Her specialization lies in NLP for low-resource and Dravidian languages with broad experience in code-mixed text processing especially in Kannada language (Hegde et al., 2022, 2023). She has co-authored more than 70 scientific publications with an h-index of 13. She is an active researcher and has organized shared tasks in various workshops including Dravidianlangtech2022, ICON2022, Dravidianlangtech2023, and FIRE2023 .
Asha Hegde has participated in more than 10 shared tasks and is a highly experienced researcher in organizing shared tasks on low-resource languages, such as Machine Translation in Dravidian languages-ACL2022, CoLI-Kanglish shared task@ICON2022, Sentiment Analysis in Tamil and Tulu- DravidianLangTech@RANLP 2023, and Tulu word-level LI (Coli-Tunglish). She has published more than 13 research articles with H.L. Shashirekha and most of them are on applications for Dravidian languages (Hegde et al., 2022, 2023).
Email: kavyamujk@gmail.com
Lecturer at Department of Computer Science
Mangalore University, Mangalore - 574199, India
Email: sharalmucs@gmail.com
PhD student
Department of Computer Science
Mangalore University, Mangalore - 574199, India
Fazlourrahman Balouchzahi, Sabur Butt, A Hegde, Noman Ashraf, HL Shashirekha, Grigori Sidorov, and Alexander Gelbukh. 2022a. Overview of colikanglish: Word level language identification in codemixed kannada-english texts at icon 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 38–45.
Fazlourrahman Balouchzahi, Hosahalli Lakshmaiah Shashirekha, Grigori Sidorov, and Alexander Gelbukh. 2022b. A Comparative Study of Syllables and Character Level N-grams for Dravidian Multi-script and Code-mixed Offensive Language Identification. Journal of Intelligent & Fuzzy Systems, (Preprint):1– 11.
Asha Hegde, Sharal Coelho, and Hosahalli Shashirekha. 2022. MUCS@ DravidianLangTech@ ACL2022: Ensemble of Logistic Regression Penalties to Identify Emotions in Tamil Text. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 145–150.
Asha Hegde, Hosahalli Lakshmaiah Shashirekha, Anand Kumar Madasamy, and Bharathi Raja Chakravarthi. 2023. A Study of Machine Translation Models for Kannada-Tulu. In Third Congress on Intelligent Systems: Proceedings of CIS 2022, pages 145–161. Springer.
Shashirekha Hosahalli Lakshmaiah, Fazlourrahman Balouchzahi, Mudoor Devadas Anusha, and Grigori Sidorov. 2022. Coli-machine learning approaches for code-mixed language identification at the word level in kannada-english texts. Acta Polytechnica Hungarica, pages 123–141.