Detecting emphasized spoken words by considering them prosodic outliers and taking advantage of HMM-based TTS Framework

Author(s): Hui Liang


A fresh approach to detecting emphasised spoken words, where the concept of one-class classification is adopted, is investigated in this research work, such that a major difficulty - collecting a large amount of well-annotated training data containing emphasis - can be avoided. The key idea, in brief, is that after rich context-dependent phone models are trained on common, neutrally read speech data in the HMM-based speech synthesis framework, emphasised words are considered prosodic outliers with respect to these "neutral" phone models and thus get detected. Experiments were conducted on speech data in the German language without any simplifying assumption (e.g. there was only one emphasised word in each utterance). Under many conditions this universally applicable approach was found to outperform totally random guessing, even though the emphasised words constituted only a small portion (i.e. 6.28%) of the test set.