The SpeeD-IA Project

Speech Datasets and Models for Indo-Aryan Languages

Supported by Karya Inc.

Objective of the Project

The main objective of the project is to build a speech dataset of at least 800 hours consisting of around 200 hours in 4 Indian languages from the Indo-Aryan language family - Awadhi, Bhojpuri, Braj and Magahi. The project will also prepare phone sets, language models and baseline models for speech recognition in these languages.

The Project is being implemented by

UnReaL-TecE LLP (Project Leader)
Council for Strategic and Defense Research, New Delhi
Indian Institute of Technology, Kharagpur
Karya Inc., Gurugram

The Pilot Project

The pilot project has resulted in the collection of approx. 18 hours of data in all the four languages. The data is available here

Phase I

In the first phase of the project, we have started collecting 50 hours of speech data in each language and transcribing those.

Questions?

Contact [unrealtece@gmail.com] to get more information on the project

Page updated

Google Sites

Report abuse