Dataset

The data collected in Bengali and Bhojpuri as part of the RESPIN project will be shared with the participants of this challenge. RESPIN is an initiative aimed to collect dialect-rich read-speech corpora in 9 Indian languages. For this special session, we will share around 850 hours of read-speech data, spoken by nearly 2000 speakers for each of the 2 languages. A developmental set in each language will be made available along with the training set so that participants can evaluate the performance of their intermediate models. However, a blind test set will be released later. Both dev and test sets contain only unseen speakers and sentences balanced across dialects. Apart from this, dialect-rich text corpora composed by language experts of the corresponding dialects will also be made available. Baseline training recipes and model weights will also be made public.


The audio files can be accessed at https://ee.iisc.ac.in/madasr23dataset/

The transcripts and relevant preprocessed codes are available at the challenge GitHub repository https://github.com/bloodraven66/RESPIN_ASRU_Challenge_2023/

Submit the results at https://forms.gle/RQct8t3s5RCHxtg46

Data Split description