A study on BLSTM-RNN-based Chinese prosodic structure prediction in a unified framework with character-level features

Author(s): Yi Zhao, Chuang Ding, Nobuaki Minematsu and Daisuke Saito


In Text-to-Speech system, prosodic attributes have to be predicted only from input text. The accuracy of prosody prediction has a significant effect on the naturalness of synthesized speech of Chinese. In this paper, we explore using neural networks to predict prosodic boundaries from Chinese text without task specific knowledge or sophisticated feature engineering. We examine sequence character-level features and word-level features, and compare their performance with one-hot and embedding representations. Instead of traditional cascaded prediction, we propose a unified framework which can be considered to be a multi-task learning process. Experimental results show that character-level features can obtain approximate F-scores compared to those with word-level features, and embedding features learned from large unlabeled texts can help to enhance the performance. The unified framework can achieve similar performance to cascaded framework, while using less training time and without the necessary of preparing task-specific features.