The main objective of the project is to build a speech dataset of at least 800 hours consisting of around 200 hours in 4 Indian languages from the Indo-Aryan language family - Awadhi, Bhojpuri, Braj and Magahi. The project will also prepare phone sets, language models and baseline models for speech recognition in these languages.
The Project is being implemented by
UnReaL-TecE LLP (Project Leader)
Council for Strategic and Defense Research, New Delhi
Indian Institute of Technology, Kharagpur
Karya Inc., Gurugram
The pilot project has resulted in the collection of approx. 18 hours of data in all the four languages. The data is available here
In the first phase of the project, we have started collecting 50 hours of speech data in each language and transcribing those.
Contact [unrealtece@gmail.com] to get more information on the project