Duration of the Project: April 2022 - March 2024
Deliverables
In this initial phase, the main objective of the project is to build a speech dataset of at least 1,200 hours consisting of around 200 hours in 6 Indian languages from the Tibeto-Burman language family. The project will also prepare phone sets, language models and baseline models for speech recognition in these languages. In the next stages / phases of the project, we plan to expand to more languages including those from Austro-Asiatic and Tibeto-Burman language families. The aims and objectives and complete deliverables of the project are as listed below -
To build a transcribed speech dataset of approximately 200 hours each in 6 Tibeto-Burman languages - Bodo (mainly spoken in Assam), Meetei (mainly spoken in Manipur), Chokri (mainly spoken in Nagaland), Kok Borok (mainly spoken in Tripura), Nyishi (mainly spoken in Arunachal Pradesh) and Toto (mainly spoken in West Bengal)
To develop a phone set for each of the languages under study.
To build a language model for the languages under consideration here.
To build a baseline ASR system for each of the above languages.
To make the dataset and pre-trained and fine-tuned models publicly available through Bhashini / ULCA and also other platforms and sources including GitHub and other appropriate repositories and server under CC-By 4.0 license (for dataset) and AGPL v3 (for the model).
Methodology
The project aims to collect and transcribe speech data from 6 Tibeto-Burman Indian languages spoken in North-Eastern and Eastern parts of India - Bodo (mainly spoken in Assam), Meetei (mainly spoken in Manipur), Chokri (mainly spoken in Nagaland), Kokborok (mainly spoken in Tripura), Nyishi (mainly spoken in Arunachal Pradesh) and Toto (mainly spoken in West Bengal). The data for each language will be collected from approximately 80 - 100 speakers, with a roughly equal proportion of male and female speakers, and mostly from the age groups of 20 - 50 years. While we will focus on collecting data from the specific domains of education, agriculture and science and technology, in order to have a wide coverage of the dataset, we will also collect a reasonable amount of data (at least â…“ of the total dataset) from other domains such as governance and policy, medical, tourism, politics, religion, culture, food, sports, entertainment etc. Out of the total raw datasets, around 100 hours of data in each language will be manually transcribed in the official / most popular script being used for the respective languages / IPA, time-aligned at the individual utterance / sentence level. The rest of the data will be incrementally transcribed using automated methods with human-in-the-loop - the human will validate the automated transcriptions. The dataset will be collected and transcribed using field methods of linguistic data collection, supported by a mobile app custom built by one of our collaborators, Karya Inc., for speech data collection. More specifically the data collection will involve traditional field methods and language documentation techniques such as elicitation and translation-based methods, narrations and descriptions, role-play and sociolinguistic (including ethnographic and Labovian) methods of observation and collection (especially for conversational data). These methods will be combined with the conceptualisation of different activities as microtasks to be completed by the language speakers (enabled through Karya). These will ensure a relatively quick collection of data, a wider coverage of the dataset and also usability of the dataset for purposes beyond the technology development such as linguistic studies and language teaching.
The raw speech dataset and transcriptions will be used for training the first exclusive multilingual ASR model for these languages. Since the languages represented in these models are also internally diverse (and at least one of those - Hindi or English or both - belong to a different language family), it will allow the models to generalise better than a monolingual model. Also the Tibeto-Burman language family have hundreds of other low-resource languages and it is really tricky to collect huge amount of data in any one language (because of the sparse speaker population and their minimal usage on open social media), this will allow us to practically build a reasonably good model for all the languages - this would not be possible by trying to build model for one or two languages. We believe that both the dataset and the models will enable building speech and language technologies for the other extremely under-resourced languages in the language family with much less resources using methods of transfer learning and multilingual processing.