Team DISTIL

DIalectal Speech Transcription in Indian Langauges

We are a team of collaborators for the SLT code hackathon, working on dialectical speech recognition in two low-resourced Indian languages - Bengali and Bhojpuri.

Problem

We are building dialectal speech recognition models for two Indian languages - Bengali and Bhojpuri. There is a lack of open-source, high-quality, curated data capturing dialectal variations in Indian languages. The Bengali language, with at least 7 known dialectal variations [1], is spoken by around 78 million speakers [2] in Eastern India and it is written in Bangla script. While there is an ASR corpus collected recently [3], it is of standard Bengali dialect and it may not be suitable to accurately recognise speech in other Bengali dialects. Bhojpuri, on the other hand, does not have a large ASR corpus. Although Bhojpuri is sometimes considered a variation of the Hindi language [2], it is spoken by nearly 50 million speakers [2]. The Bhojpuri language uses the Devanagari script as that of Hindi, however, there are considerable linguistic changes and accent variations compared to those of Hindi. Due to this, an ASR model primarily trained on the Hindi dataset, may not perform well on Bhojpuri speech.

There have been some open-source models on these languages [2], which take advantage of self-supervised pretraining and fine-tuning in Indian languages. But the performance of such ASR models suffers as these fine-tuning using standard datasets do not cater to different dialects of a language. We propose to build ASR models which can cater to various dialects in Bengali and Bhojpuri, using the data collected as a part of the RESPIN project (https://respin.iisc.ac.in)- Speech Recognition in Agriculture and Finance for the Poor in India, by SPIRE Lab, Indian Institute of Science and its partner, Navana Tech, India.

Impact

The dataset is collected in Bengali and Bhojpuri to enable the development of speech-enabled applications that can cater to users speaking different dialects of the language and not necessarily the standard form often spoken in urban areas. The speech data in the RESPIN project is of read speech style and is collected in a crowd-sourced manner based on sentence prompts, in agriculture and finance domains, prepared separately for every chosen dialect in order to capture lexical properties of the dialects causing mutual unintelligibility. With the dataset and models open-sourced, it will allow the integration of models with speech-enabled applications and technological innovations in various areas in emerging markets, taking the digital transformation to cater to a wider population including those in rural areas.

Existing speech recognition solutions offered by various companies do not cater to most regional Indian languages. Typically ASR models are mainly built on standard dialect(s) of a language, causing them to perform poorly in the presence of variations in dialects. Additionally, a general-purpose ASR model can’t be directly used in domains such as agriculture and finance which have a lot of domain-specific terminologies and jargon. Hence, integrating such generic ASR models has its performance limits in applications that need to cater to a larger population speaking different regional variations of a language. With the dataset collected as part of the RESPIN project, major dialects in various regional Indian languages will be covered enabling the development of speech recognition models in those dialects.

Project

Corpus

We are using the dataset collected as part of the RESPIN project. We will be using around 1000 hours of labelled data each in Bengali and Bhojpuri. On using Google Cloud speech (GCP) recognition API on the dataset, we obtain around 40 % WER in Bengali (using Bengali GCP ASR API) and around 49 % WER in Bhojpuri (using Hindi GCP ASR API), so there is a lot of room for improvement in developing models for the multi-dialect ASR dataset from RESPIN project.

Developing models

We are building various types of speech recognition models on these datasets to optimise the performance on dialect-specific test sets. This includes monolingual & multilingual ASR models on the corpus, utilising self-supervised pretraining. We are working with various toolkits and releasing the recipes for training models. This includes the conformer ctc+attention model with ESPnet [5], transducer models with Nemo [6], wav2vec / Hubert-based models in fairseq [7]/ s3prl [8] / hugging face, TDNN-HMM with Kaldi [9] etc.

References :

[1] Grierson, G A., Linguistic Survey of India: Vol. 5 Indo-Aryan Family (Eastern Group) pt 2 Specimens of the Bihari and Onya languages. Calcutta, 1903-1927

[2] INDIA OFFICE OF THE REGISTRAR GENERAL, Census of India 2011, https://censusindia.gov.in/2011Census/C16_25062018_NEW.pdf, Last accessed 07 October 2021.

[3] O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, L. Ha, Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali, Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018

[4] Chadha, H. S., Gupta, A., Shah, P., Chhimwal, N., Dhuriya, A., Gaur, R., & Raghavan, V. (2022). Vakyansh: ASR Toolkit for Low Resource Indic languages. doi:10.48550/ARXIV.2203.16512

[5] Guo, P., Boyer, F., Chang, X., Hayashi, T., Higuchi, Y., Inaguma, H., … Zhang, Y. (2020). Recent Developments on ESPnet Toolkit Boosted by Conformer. doi:10.48550/ARXIV.2010.13956

[6] Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Cohen, J. M. (2019). ‘NeMo: a toolkit for building AI applications using Neural Modules’. doi:10.48550/ARXIV.1909.09577

[7] Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., … Auli, M. (2019). fairseq: A Fast, Extensible Toolkit for Sequence Modeling. CoRR, abs/1904.01038.

[8] Yang, S.-W., Chi, P.-H., Chuang, Y.-S., Lai, C.-I. J., Lakhotia, K., Lin, Y. Y., … Lee, H.-Y. (2021). SUPERB: Speech Processing Universal PERformance Benchmark. Proc. Interspeech 2021, 1194–1198. doi:10.21437/Interspeech.2021-1775

[9] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … Vesely, K. The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society.