Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

Demo page

Audio samples of linguistic-speech regularization training

You can use this data only for research for non-commercial purposes.

Paper

"Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion,"

in Proc. 12th ISCA SSW, Grenoble, France, Aug. 2023, pp. xxx-xxx. (accepted)

Authors

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari

Abstract

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.