The SpeeD-IL Project
Speech Datasets and Models for Indian Languages
Objectives of the Project
Over the last decade or so, research in speech technologies has seen a rapid and successful shift towards exclusively data-driven techniques such as machine learning and deep learning methods. Over the years, experiments with well-resourced languages such as English have demonstrated the success of these systems given sufficient data for training the systems. However, barring a handful of languages, this technological revolution has escaped most of the languages (including the officially supported, scheduled languages) spoken in India. This could be gauged from the commercial support for very few Indian languages across different speech-based products - Amazon Alexa supports Hindi among seven other international languages; Google Home supports 13 languages, including Hindi, as the only Indian language; Microsoft supports Indian English, Hindi, Tamil, Telugu, Gujarati, and Marathi for its ASR systems - there is no support whatsoever for most of the other Indian languages, especially languages belonging to the Tibeto-Burman and Austro-Asiatic language families.
One of the primary reasons behind this could be the non-availability of sufficient speech datasets for most Indian languages. This is even more so for the non-scheduled Indo-Aryan and Dravidian languages and even the scheduled languages from the Tibeto-Burman and Austro-Asiatic language families, largely spoken in Eastern and North-Eastern parts of India. Bodo and Meetei, two languages of the Tibeto-Burman language family and Santhali, the only language of the Munda sub-group of the Austro-Asiatic language family, are the only languages of these language families that are represented among the 22 scheduled languages.
In order to alleviate this situation, we have started the `SpeeD-IL' (Speech Datasets and Models for Indian Languages) project for developing speech corpora and other resources and models for underrepresented languages across different languages in India. The stated aims and objectives of the project are listed below -
To build a transcribed speech dataset of approximately 1000 hours across each of the four major language families of India - Tibeto-Burman, Austro-Asiatic, Dravidian and Indo-Aryan - and the other language families with fewer languages viz. Tai-Kadai and Great Andamanese. In each language family, at least 10 underrepresented languages will be included for data collection.
To develop a phone set for each of the languages under study.
To build baseline wav2vec 2.0 (or other state-of-the-art techniques) pre-trained models based on the data collected in the project for each language family under study.
To build a language model for the languages under consideration.
To build a baseline ASR system for each of the languages.
To make the dataset and pre-trained and fine-tuned models publicly available through appropriate platforms under CC BY-SA-NC 4.0 license (for the dataset) and AGPL v3 (for the models).