For all tracks, the TTS systems and the neural codec models can be a trained open-source model or can be trained on challenge data + open-source data
All synthesised audios must be 16kHz
The attribute-specific neural codec requirement is only for track 2. The details of these attributes are to be shared with the organisers.
Any model architecture is allowed.
Pretrained speaker encoders can be used to represent speakers.
Any loss auxiliary objectives are allowed, such as ASR-based penalty, speaker similarity, MOS prediction etc.
For tracks 1 and 3, the evaluation will involve synthesis across different speakers in the challenge data.
For track 2, the evaluation will involve synthesis using zero-shot voice cloning