Tone modeling using Gaussian process latent variable model for statistical speech synthesis

Author(s): Decha Moungsri, Tomoki Koriyama and Takao Kobayashi


In continuous speech of Thai language, tone pronunciation is affected by several factors. One of significant factors is stress that causes a diversity of F0 contours of tone, and also affects syllable durations. Our previous studies have shown that a stressed/unstressed syllable context improves tone modeling accuracy. However, the stress in Thai language is generally unknown for a given input text and it has a wide variety of degrees of stress. Thus the simple stressed/unstressed context is not enough to represent the intensity of stress. In this study, we introduce an unsupervised dimensional reduction technique, variational GP-LVM, to represent a diversity of stress. The stress-related information, F0 contour and duration, is projected onto a latent space which has lower dimensionality than the original to represent the degree of stress. Then, we use data points in the latent space as a context in GPR-based speech synthesis framework that allows us to determine the similarity of contextual factors continuously using a kernel function. We examine two approaches to data projection: single-space projection and separated-space projection. Objective and subjective evaluation results show that the proposed technique achieves an improvement in tone modeling.