Challenge Rules

Challenge Rules:

NOTE:

  • Participants may have to submit their code if requested by challenge organizers. However, the intellectual property (IP) is not transferred to the challenge organizers, i.e., if the code is shared/submitted, the participants remain the owners of their code (if the code is made publicly available, an appropriate license should be added)

  • Based on the object evaluations, only the best 10 submissions will be shortlisted for final subjective evaluation from each track.

  • The challenge organizers will invite the top 5 ranked teams to submit a 2-page paper at ICASSP-2023


  1. Participants are expected to sign an agreement when they download the data.

  2. Participants will need to share their final TTS through an API.

  3. In case of any clarification or discrepancy, they should be ready to submit code as well (for this case IP clause is applicable)

  4. Post the challenge, we request every participant to open-source the code or submit the code to us to open-source (for this case IP clause is applicable)

  5. The participants will not use any external dataset for training or pretrained TTS models trained on external datasets.

  6. The participants may use pretrained speaker embeddings (or train it on a given corpus) such as x-vector, i-vector, etc, but these embeddings must be one-dimensional. i.e. sequence-level embeddings are not to be used.

  7. The models should be trained at 22050Hz sampling rate.

  8. Participants may perform preprocessing on the text, including but not limited to unicode normalisation, phonemisation, etc. Take care to cover all the expected range of characters that can be present in the eval set. (link)

  9. All decisions taken by the organizers are final. If there is a lack of clarity on any points, we encourage the participants to reach out to us [challenge.syspin@iisc.ac.in].



Track 1 Rules

  1. Participants may only use at max 5 hours of audio data (and corresponding text) from each speaker, for training the model. The participants must share the file IDs used for training and should replicate the results with the same if the need arises.

  2. A dev set may be used with total duration < 1hr, and these files will not to used with train set.

  3. Participants may use text to melspectrogram (with an additional vocoder) or text to waveform models

  4. Participants may use the vocoder provided by organisers or use their own vocoder (trained only on challenge data - 40hrs per speaker data can be used)


Track 2 Rules

  1. Participants may only use a text-to-Mel spectrogram model, with a max of 5 million parameters. The participants will share the layer-wise breakdown of the parameters of the final model.

  2. For fair evaluation across teams, participants must use the vocoder provided by the organizers to synthesize speech. Speaker-specific wave glow vocoder will be provided, trained with NVIDIA implementation, with default parameters.

  3. Participants may fine tune the vocoder(s) with all the shared data. The vocoder weights will be shared, along with an inference script on Dec 15th, along with the baseline model.

  4. Participants may train a larger model on the provided data, and then distil the learning into a small model, with at max 5 million parameters.

  5. The participants may rely on different forms of model compression, efficient architectures, etc, with the effective parameters for a sequence to be counted for the parameter count. (i.e if you impose sparsity on the models via pruning or other techniques, pruned parameters will not be counted). Sparse modules are also allowed, but if layers are selected explicitly based on language/speaker ids, all parameters will be counted.

  6. If model-based speaker embeddings or phonemizers are used, the models parameters will not be included for 5 million limit in TTS.


Track 3 Rules

All rules from track 1 and 2 are applicable in track3.