SpeechCorpus
CSRL Acoustic-Phonetic Continuous Speech Corpus
The Cognitive Systems Research Laboratory (CSRL) corpus of read speech has been designed to provide speech data for
the acquisition of acoustic-phonetic knowledge and for the development and
evaluation of automatic speech recognition systems.
Text corpus design, speech recording, data segmentation into phoneme units,
were done in CSRlab during May 1999 and March 2000.
The CSRL Speech Database contains a total of 2000 sentences.
Which includes 1000 English and 1000 Hindi sentences recorded through variety
of telephone Instruments (Telephonic recording)
While selecting speakers it was taken care that they are from different walks
of life. Speakers selected were from different age groups, spoke different
languages, were from different parts of India, had different educational levels,
and had different degree of fluency in the languages in which recording was
done. A total of 20 sentences were spoken by each of 100 speakers from 11 major
linguistic regions of India (10 English and 10 Hindi).
Table below shows the number of speakers for the 11 linguistic regions
A speaker's linguistic region is that geographical area of India where a single
major language is spoken predominantly. And where they or their forefathers
lived speaking the language predominant in that region These linguistic
regions roughly correspond to the states in the Indian Union
Linguistic
Region(lr) no. of Speakers
---------- ---------
1 23
2 07
3 32
4 05
5 08
6 02
7 06
8 05
9 04
10 05
11 03
------ ---------
Total 100