TTS Training Data


Click here to download the data:

The creation of the dataset was supported by Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) on behalf of the German Ministry for Economic Cooperation and Development

Dataset set description:

  • Languages: Hindi, Marathi, Telugu (1 male, 1 female voice artist from each language)

  • Recording specifications: Studio quality recordings with 48kHz, 24bits per sample

  • Sentences are mined from online sources as well as printed textbooks.

  • There are common sentences between male and female voice artists in a language.

  • Domains covered:

          • Agriculture

          • Health

          • Finance

          • Weather

          • Social science

          • Education

          • Politics

          • General, and

          • Miscellaneous

  • Training data will have all special characters in the text, but only a few are allowed in the evaluation set. The expected characters in eval set is available at the challenge github repository.

  • The recordings in this dataset, while primarily matched to their respective text, may have the following errors.

          • The audio is fine, but the text contains typographical errors.

          • Audio and text don’t match, but there are problems in audio like the wrong pronunciation of words, the speaker speaking too fast, difficulty understanding the speech, long pauses in between words, etc.

          • Audio and text may match, but the sentence might be incomplete, or meaningless.

          • The audio matches the text exactly but has distortion during the speech.

          • The audio matches the text exactly but has distortion in the beginning and/or end silence only.

          • The sentence is from a language, but not written in the script of that language (Hindi sentence written in Telugu script).

  • The table below summarizes the details of the corpus used in this challenge.

Domain-wise distribution