The submissions will be evaluated on naturalness and speaker similarity scores, for mono-lingual and cross-lingual synthesis
Apart from the subjective evaluations, normalised latency will be reported for track 1 and bitrate for the codecs will also be reported for track 2. For track 3 both normalised latency and bitrate will be reported.
Tracks 1 and 3 will have 48 long utterances specifically used for objective evaluation other than the 48 utterances for subjective evaluation.
For subjective evaluation, each submission will involve ratings from 48 overall utterances, with an equal split across mono-lingual and cross-lingual synthesis.
Each generated audio will be rated by 3 evaluators, native to the target language, leading to 144 evaluations per team.
If there are more than 10 submissions, the top 10 teams will be first selected based on ASR scores. The subjective evaluation will be done for the top 10 teams.
For ASR scores we will be using ASR models trained on the respective languages. The top 10 teams will be shortlisted based on the Character Error Rate (CER).
The normalised latency will be calculated as shown below Let F be the total audio frames generated, TF be the time required to generate the first audio frame and TR be the time required to generate the rest of the F audio frames. The time required per frame T is given by
T= TF + TR/( F−1)
We use this to calculate the normalised latency. Let N be the total number of test set audios TP (i) the time required per frame in seconds by the proposed method to generate i th test audio and TB(i) be the time required per frame in seconds by the normalising model to generate the ith test audio. Normalised latency L is given by
L = ∑i=1N TP(i) / TB(i)
We will be providing the model and script to obtain the TB(i)
The bitrate will be calculated as shown below: Let discrete units be D1, D2, . . . .DN where Di be the discrete tokens and N be the total streams. Let the vocabulary size for i th stream be defined as Vi, Li be the length of ith stream and T is the total time of the test set audios in seconds. We define bitrate B as
B = ∑i=1N. L(i) / (T × log2(Vi))
For track 1 we pick the top 5 teams using the normalised latency and for track 2 we pick the top 5 teams using the bitrate. For track 3 we will pick the top 5 teams using both metrics resulting in 10 teams. Next, we do the subjective evaluation for these teams and pick the top 5 for each of the tracks. To get the top 5 teams for the challenge we will use the ranks of the teams across all the tracks they have participated in.
The split for sentences across languages for individual tracks is shown below