Dataset

We use the data from RESPIN project at SPIRE Lab, Indian Institute of Science, Bengaluru, India. As part of the hackathon, we are working with around 1100 hours of recorded speech data in Bhojpuri and Bengali languages each. The data consist of 3 dialects in Bhojpuri and 5 dialects in Bengali. Bengali corpus has equal amounts of data from 5 dialects - Varendri, standard colloquial Bengali, western Bengali, Jharkhandi, and Rajbanshi. The Bhojpuri corpus comprises an equal amount of data from Northern, Southern and Western Bhojpuri dialects. The dataset is a read speech corpus, with sentences prepared in agriculture and finance. The dataset also has extra dialect-specific sentences for language modelling. On using Google Cloud speech (GCP) recognition API on the dataset, we obtain around 40 % WER in Bengali (using Bengali GCP ASR API) and around 49 % WER in Bhojpuri (using Hindi GCP ASR API). More details about the data is tabulated below.