Linguistics and Computational Linguistics Research @CTRANS
The links and details on this page is currently being updated!
This page lists the Linguistics and Computational Linguistics research by Dr. Ritesh Kumar and his group working at the K.M. Institute of Hindi and Linguistics and the Centre. Explore the resources and technologies developed below.
Research on Minoritised and Endangered Languages and Varieties of Language
For most of the languages mentioned on the left, there is a high probability that you might not have heard their name, let alone being aware of language descriptions, resources or technologies for them. One of our primary aims is to produce at least basic resources and language processing technologies for all of the 100s or even 1000s of Indian and South Asian languages. Here we list the very humble start that we have made in that direction and we hope to accelerate and give it a major push in coming years.
Bundeli
Text corpus of Bundeli. It is at the final stages of proofreading and annotation and we hope to release it soon.
Eastern Hindi Variety [Speech]
Speech corpus of Eastern Hindi Variety (from Bihar). The corpus is currently under process and we should be able to release it soon.
Hate Speech and Aggressive Language Research
It is probably because of the very early beginnings of research in aggressive language at our University that Hindi is today a rather resource-rich language as far as aggressive and hate speech research is concerned. Of course, in the last couple of years, there have been more resource and technology development efforts in the field for Hindi but our University was undoubtedly a pioneer in the field. We strive to continue this tradition as we now have started exploring hateful and aggressive speech in other major languages of India, beyond Hindi.
The Aggression Project
Supported by UK-India Education and Research Initiative and carried out in collaboration with multiple institutions from India and UK, the project led to the development of first speech dataset of over 50 hours in Hindi and English, marked with aggression as well as a tool to automatically recognise aggression in Hindi and English speech (a demo of the tool is available here - http://panlingua.co.in/art/ and the dataset will be released soon here - https://github.com/kmi-linguistics/speech-aggression)
Detection of Aggressive Behaviour on Social Media
Supported by Microsoft Research, this project resulted in the first dataset from Twitter and Facebook in Hindi and English as well as models for automatic detection of aggression in social media text. Visit the Github page for the dataset.
The ComMA Project
Supported by Facebook Research, this ongoing project aims to build multilingual, multimodal datasets and recognition systems for recognising, misogyny, communalism and other forms of aggression in Indian languages such as Bangla, Hindi, Meitei, English and others. Please visit the project website for more details.
Applications / Competitions / Shared Tasks
Besides the two major areas of research mentioned above, we contribute to different areas of Linguistics via course projects, dissertations, etc. as well as other kinds of research. We list some of our significant contributions here, especially those which are built as assistive technologies for researchers working in language documentation and revitalisation of endangered, minoritised and lesser-known Indian languages. We also list some of the systems that we submitted for shared tasks / public competitions.
mScrabble
mScrabble is a multilingual mobile and web-based version of the popular language game - Scrabble - especially aimed towards the endangered and lesser-known languages of India. More importantly, it allows for generating mScrabble games for different languages using only a dictionary of words in the concerned language and a list of characters in the script. The app is currently available for Koda, Mahali (two critically endangered Austro-Asiatic Indian Languages), Magahi, Bhojpuri, Awadhi, Braj Bhasha, besides Hindi, Bangla and English and more languages are being continuously added.
Bahubhashi (Multilingual)
Bahubhashi is a multilingual language assistant that could perform various language-related tasks such as spell and grammar checking, word prediction, etc for Indian languages. It is currently under active development and a beta version will soon be released for Hindi and a few other lesser-known and endangered languages such as Magahi, Bhojpuri, Koda, Mahali, etc.
Linguistic Field Data Management and Analysis System [LiFe]
It is a project to build an app for managing linguistic field data, especially within the context of language documentation, assisting in the analysis, publishing and exporting in multiple formats and also keeping the data in a way that it could be leveraged for developing NLP applications. The app is currently under development and we hope to release the test alpha version within the next couple of months. The app is mainly targeted towards researchers working in the fields of language documentation and revitalisation.