Phonetically conditioned prosody transplantation for TTS: 2-stage phone-level unit-selection framework

Author(s): Mythri Thippareddy, M. G. Khanum Noor Fathima, D. N. Krishna, A. Sricharan and V. Ramasubramanian


We propose a framework of prosody transplantation for TTS, namely, 2-stage phone-level unit-selection, to transfer the prosody from a `target' prosody database onto a conventional TTS output unit-sequence. The framework employs 'phonetic conditioning', wherein target prosody-profiles are identified conditioned on their underlying phonetic content over variable length time-scales that tend to be as long as possible. In this 2-stage unit-selection framework, the units determined in a 1st-stage conventional unit-selection are mapped to units in a 2nd-stage prosodic-style database via a phone-level unit-selection, which retrieves units from the 2nd-stage prosody-database with associated prosody (representing the prosodic-style of the 2nd stage prosodic-database) and the selected prosody is further incorporated on to the 1st-stage units. This framework was recently proposed by us with early qualitative results indicating the viability of the approach. In this paper, we elaborate on this approach and characterize the performance of the proposed frameworks using various objective measures using prosodic ground truth, and with respect to the parameters of the system, and show the viability of the proposed approach to realize the target prosody very effectively.