Challenge Tracks

Challenge Tracks:

TRACK 1: Data selection

Recent literature has shown that a limited amount of data may be sufficient to build high-quality speech synthesis models for multiple speakers and languages setup. In this sub-track, we share 40 hours of data from each of the six speakers, from which participants can use a maximum of 5 hours of data, from each speaker, to train one multispeaker, multilingual model. The goal is to identify subsets of data from the larger corpus, which can be used for multispeaker, multi-lingual TTS training. The presence of parallel sentences across speakers in a language can also be exploited in this formulation.


TRACK 2: Lightweight TTS

The size of the TTS model is an important detail to be considered while deploying such models for practical use. It is not economical to host large-scale models built for research for many applications. Also, it would be ideal to incorporate a multitude of speakers and languages in a single model to further lower the cost of hosting models. Towards this, we propose this track to build lightweight multispeaker, multi-lingual TTS models, which can employ techniques such as model distillation, compression, lighter model architectures, etc. In this problem statement, we limit the TTS model (text to Mel spectrogram) to have 5M usable parameters while a fixed vocoder is to be used as provided by the organizers. This limit is set based on the advances made in recent works in TTS.


TRACK 3: Lightweight model development from best data

This track is a combination of Track 1 and Track 2. The participants are required to build one multi-speaker, multi-lingual, lightweight speech synthesis model by utilizing at max 5 hours of data from each speaker, and overall model (text to Mel spectrogram) parameters to be less than 5M. The participants must use the vocoder provided by the organizers.