SpeechCorpus

CSRL Acoustic-Phonetic Continuous Speech Corpus

The Cognitive Systems Research Laboratory (CSRL) corpus of read speech has been designed to provide speech data for

the acquisition of acoustic-phonetic knowledge and for the development and

evaluation of automatic speech recognition systems.

Text corpus design, speech recording, data segmentation into phoneme units,

were done in CSRlab during May 1999 and March 2000.

The CSRL Speech Database contains a total of 2000 sentences.

Which includes 1000 English and 1000 Hindi sentences recorded through variety

of telephone Instruments (Telephonic recording)

While selecting speakers it was taken care that they are from different walks

of life. Speakers selected were from different age groups, spoke different

languages, were from different parts of India, had different educational levels,

and had different degree of fluency in the languages in which recording was

done. A total of 20 sentences were spoken by each of 100 speakers from 11 major

linguistic regions of India (10 English and 10 Hindi).

Table below shows the number of speakers for the 11 linguistic regions

A speaker's linguistic region is that geographical area of India where a single

major language is spoken predominantly. And where they or their forefathers

lived speaking the language predominant in that region These linguistic

regions roughly correspond to the states in the Indian Union

Linguistic

Region(lr) no. of Speakers


---------- ---------

1 23

2 07

3 32

4 05

5 08

6 02

7 06

8 05

9 04

10 05

11 03

------ ---------

Total 100