SilhoueTTS: Reference Guided Text to Speech using Prosody Features
SilhoueTTS: Reference Guided Text to Speech using Prosody Features
Authors
DongSik Yoon (Korea University) kevinds1106@korea.ac.kr
Chungho Park (DeepMachineLab) mattpark@dmlab.ai
Jeongki Min (Korea University) vhfnxnrkf@korea.ac.kr
Hanseok Ko (Korea University) hsko@korea.ac.kr
Abstract
The existing deep learning based text-to-speech (TTS) approaches could synthesize human-like voices rapidly and naturally. Especially, the progress of non-autoregressive model has greatly contributed to the development of various versions of TTS. Nevertheless, incorporating natural rhyming changes, speaking styles, and emotional tones in voice generation are still challenging. To solve this problem, we propose a novel framework that reference voice style-based TTS using prosody features. In order to extract the prosody feature of reference voice, we propose a style encoder. Further, we adopt AdaIN layer in our variance adaptor to reflect the extracted features into overall results. We propose a prosody loss function that compares reference features and target embedding vectors induces to better mapped to disentangled spaces. Experiment results demonstrate the proposed method could detailed reflect the style of reference voice and synthesize more clear voice compared to the existing methods.
Audio Samples
LJSpeech: the commission has found that oswald's movements, as described by these witnesses
(Ground Truth)
Reference audio: Emov-DB
reference audio:
GST :
Styler :
Ours :
LJSpeech: she thought he was attending a class or was on his own business
(Ground Truth)
reference audio:
GST :
Styler :
Ours :
Reference audio: Korean Emotion Conversation Corpus
LJSpeech: whose major wound fell within doctor shaw's area of specialization
(Ground Truth)
reference audio:
GST :
Styler :
Ours :