Using local phrase depenency structure information
in Neural sequence-to-sequence speech synthesis
Nobuyoshi Kaiki†, Sakriani Sakti†‡ and Satoshi Nakamura†‡
†Nara Institute of Science and Technology, Japan, ‡RIKEN AIP, Japan
Nobuyoshi Kaiki†, Sakriani Sakti†‡ and Satoshi Nakamura†‡
†Nara Institute of Science and Technology, Japan, ‡RIKEN AIP, Japan
Abstract
We introduce end-to-end text-to-speech synthesis (TTS) with prosodic symbols that represent phrase components based on local syntactic dependency structures for synthesizing Japanese speech with natural prosody. We propose two TTS models: 1) one with prosodic symbols representing the syntactic dependency distance at the phrase boundaries and 2) another with prosodic symbols that reflect a superimposed model of the phrase and accent components based on an F0 generation control mechanism. Using these two models, we observed 1) pause insertion that indicates the phrase boundary and 2) F0 resetting at the right-branching boundaries. To verify the effectiveness of these two proposed models against the conventional model using only accent components, we conducted an AB test as a subjective evaluation. Our result confirmed that synthetic speech with natural prosody, which reflects the corresponding intention to the utterance, was generated using the local phrase dependency information of sentences and the F0 generation model in a Japanese end-to-end TTS.