STOMA: Towards real-time, enhanced text-to-speech synthesis on the device (PI, 2022-2024)
ELIDEK
ELIDEK
Text-to-Speech (TTS) Synthesis facilitates to ease the human-machine communication by converting arbitrary text to informative and natural speech signals. Recent TTS systems based on neural network models constitute a major breakthrough in terms of quality, naturalness and expressivity of synthetic speech. Unfortunately, neural-based TTS systems have millions of trainable parameters demanding not only expensive computational resources but also large collections of high-quality recordings. Subsequently, the development and deployment of product quality speech synthesis systems is laborious and costly and, at the same time, the generation of synthetic speech (i.e., the inference process) is slow.
STOMA fundamentally targets on devising and implementing novel and lighter TTS synthesis systems whose parametric space will be significantly smaller than the existing without sacrificing the speech quality and naturalness. Lighter TTS systems enable for real-time speech production on the device due to the need for fewer computational operations inside the neural network. Additionally, STOMA takes into consideration the vibrating and quasi-periodic nature of speech as well as the spectro-temporal analysis of human's auditory system and it will be robustly adapted to intelligibility-enhanced synthesis systems through transfer learning.