2026.4.16: Challenge website and registration go public! We sincerely await your participation!
Founded in 2022, the VoiceMOS Challenge (VMC) series aims to use standardized datasets in diverse and challenging domains to understand and compare prediction techniques for human ratings of speech, specifically those collected through a mean opinion score (MOS) test, thus the name VoiceMOS Challenge. The main motivation is to foster development in automatic, data-driven speech assessment approaches, to overcome the costly and time-consuming human listening tests that are conventionally regarded as the gold standard for evaluating speech.
After running VMC for three years, in 2025 we organized the AudioMOS Challenge (AMC), where we enlarged the scope to singing voices, music, and even general synthetic audio. Despite the success, we received feedback from the community that main problems in the evaluation of speech remain unsolved. In 2026, we thus decided to roll back to VMC, where we put our focus back on speech again.
For this year, the primary evaluation metrics for MOS prediction will be utterance-level Spearman’s rank correlation coefficient (UTT-SRCC). As usual, there is no participation fee, and the challenge will be held on CodaBench (https://www.codabench.org/). We are still deciding on the venue for participants to submit their papers, and the current plan is to host a special session or a satellite workshop at ICASSP 2027.
The first track is on speech generated by speech enhancement systems. This track is based on the subjective listening test data from the ICASSP 2026 URGENT Challenge. The dataset contains enhanced speech samples from the top six performing systems in the challenge, evaluated on 840 multilingual utterances across nine languages. Given an input speech sample, the system is required to predict both the Absolute Category Rating (ACR) and Comparative Category Rating (CCR).
The second track focuses on synthetic speech from emotional TTS systems and emotional human speech. Given an input speech sample, the system is required to predict (1) MOS for speech quality, (2) MOS for emotion (degree of similarity to the target synthesized emotion label), as well as, optionally, (3) listeners' categorical choices of perceived emotions and ratings for valence, arousal, and dominance.
The third track targets accented English speech generated by codec-based speech synthesis systems. This track is based on the CodecMOS-Accent dataset, which contains 4,000 samples from 24 contemporary codec resynthesis and TTS systems, featuring 32 speakers across ten distinct accents. Given an input speech sample and a reference speech sample, the system is required to predict both the speaker and accent similarity scores.
NOTE: Registration is open until the end of the challenge (August 7)!
Please fill in the registration form: https://forms.gle/L6YdkUf1PJdSSwLU7
Once we confirm your registration, we will contact you, including the link to the CodaBench page, and instructions on how to download the datasets. (Note this won’t happen until the release date)
The tentative schedule for the VoiceMOS challenge 2026 is as follows:
Friday, May 22 (or earlier): Training datasets are released on the CodaBench page.
Friday, July 31: Evaluation dataset released to participants.
Friday, August 7: Predicted scores submission deadline.
Monday, August 31: Results announced.
TBA: ICASSP 2027 paper deadline.
Registration must be done using an institutional email address (e.g., university or company), not a personal one (unless you are attending as an individual researcher).
Participants are required to submit a system description after the challenge ends.
Any public dataset may be used to develop your prediction system, and the datasets used must be reported in the system description. Use of proprietary datasets, including collecting your own MOS ratings, is not permitted unless the resources are publicly available.
Track 1: URGENT-MOS
Codebase: TBA
Track 2:
Codebase: https://github.com/sarulab-speech/UTMOS22
An LLM-based method
Codebase: TBA
Emotion2Vec
Codebase: https://github.com/ddlBoJack/emotion2vec
Track 3: a speaker embedding-based method
Codebase: TBA
Wen-Chin Huang & Tomoki Toda (Nagoya University, Japan)
Erica Cooper (National Institute of Information and Communications Technology, Japan)
Wei Wang (Shanghai Jiao Tong University, China)
Marvin Sach (Technische Universität Braunschweig, Germany)
Xiaoxue Gao (A*STAR, Singapore)