Data selection for naturalness in HMM-based speech synthesis

Author(s): Erica Cooper, Yocheved Levitan and Julia Hirschberg


We describe experiments in training HMM text-to-speech voices on professional broadcast news data from multiple speakers. We compare data selection techniques designed to identify the best utterances for voice training in a corpus not explicitly recorded for synthesis, aiming to select utterances from the corpus which will produce the most natural-sounding voices. We also explore different methods for voice training and utterance synthesis that can improve naturalness. While the ultimate goal of this work is to develop intelligible and natural-sounding synthetic voices in Low Resource Languages rapidly, without the expense of collecting and annotating professional data specifically for text-to-speech, we focus on English first, in order to develop our methods. We also describe results of crowdsourced listening tests which identify the strengths and weakness of different data selection and voice training methods when rated by listeners in terms of naturalness.