Latest Updates (as of 30/5/2023):

🎉MERLIon CCS Challenge has been accepted as a special session at Interspeech 2023 🎉 We are looking forward to seeing everyone!

The following papers related to the challenge have been accepted at Interspeech 2023:

Y. H. V. Chua, H. Liu, L. P. Garcia Perera, F. T. Woon, J. Wong, X. Zhang, S. Khudanpur, A. W. H. Khong, J. Dauwels, and S. J. Styles, "MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization". http://arxiv.org/abs/2305.18881
S. J. Styles, Y. H. V. Chua, F. T. Woon, H. Liu, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, and J. Dauwels, “Investigating model performance in language identification: beyond simple error statistics". http://arxiv.org/abs/2305.18925

The data archive for the MERLIon CCS dataset is under preparation:

Y. H. V. Chua, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, J. Dauwels, F. T. Woon, and S. J. Styles, “Development and Evaluation data for Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge,” 2023, DR-NTU (Data), DRAFT VERSION, https://doi.org/10.21979/N9/ANXS8Z

About

The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.

Due to a bias towards standard speech varieties, non-standard, accented speech remains an ongoing challenge for automatic speech processing. Although existing works have explored automatic speech recognition and language diarization in code-switching speech corpora, those tasks are still challenging for natural in-the-wild speech containing more than one language, particularly when the code-switching occurs in short language spans. Moreover, as child-directed speech contains acoustic features difficult for automatic language identification and language diarization, speech processing systems often struggle with natural speech of this kind.

Aligning closely with Interspeech 2023’s theme, 'Inclusive Spoken Language Science and Technology – Breaking Down Barriers', we present the challenge of developing robust language identification and language diarization systems that are reliable for non-standard accented, bilingual, child-directed speech collected via a videocall platform.

As videocalls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom videocall dataset: The MERLIon CCS Challenge will tackle automatic language identification and language diarization in a subset of audio recordings from the Talk Together Study, where parents narrated an onscreen wordless picturebook to their child. The main objectives of this inaugural challenge are:

to benchmark the current and novel language identification and language diarization systems in a code-switching scenario including extremely short utterances;
to test the robustness of such systems under accented speech;
to inspire the research community to propose novel solutions in terms of adaptation, training, and novel embedding extraction for this particular set of tasks;

Techniques developed in the challenge may benefit other related fields allowing greater understanding of how code-switching occurs in real-life situations.

The challenge will feature language identification (Task 1) and language diarization (Task 2). Two tracks, open and closed, are available. The tracks differ by the data used during system training.

✨DID YOU KNOW?✨ With the body of a mermaid and the head of a lion, the Merlion is a national icon of Singapore. ✨Just as the Merlion is a mix of different creatures, the Singaporean code-switched child-directed speech in this challenge is a mix of different languages✨

Important Dates

All deadlines are AOE!
Registrations Open : 18 Jan 2023 🎉

Registrations Close: 24 Feb 2023 🎉

Training Data Partitions Release: 25 Jan 2023 🎉

Evaluation Plan Release: 27 Jan 2023 🎉

Data Release (Development Set): 27 Jan 2023 🎉

Baseline System Release: 13 Feb 2023 🎉

Data Release (Evaluation Set): 16 Feb 2023 🎉

Leaderboard Active: 17 Feb 2023 🎉

Official Evaluation Closes (Leaderboard Freeze): 28 Feb 2023 Extended to 2 Mar 2023!

INTERSPEECH Paper Submission Closes: 1 Mar 2023

System Description Submission: 2 Mar 2023

INTERSPEECH Paper Update Submission Closes: 8 Mar 2023

Leaderboard Reopens*: 10 Mar 2023

INTERSPEECH Acceptance: 17 May 2023

*After the end of the official challenge period, the leaderboard will reopen for teams who want to continue developing their systems prior to Interspeech session (optional).

Evaluation Plan

MERLIon CCS Challenge Evaluation Plan (version 1.2) Updated 17 Feb 2023
MERLIon CCS Challenge Evaluation Plan (version 1.1)
MERLIon CCS Challenge Evaluation Plan (version 1.0)

Organizers

Leibny Paola Garcia Perera, Johns Hopkins University
YH Victoria Chua, Nanyang Technological University
Hexin Liu, Nanyang Technological University
Fei Ting Woon, Nanyang Technological University
Andy Khong, Nanyang Technological University
Justin Dauwels, TU Delft
Sanjeev Khudanpur, John Hopkins University
Suzy J Styles, Nanyang Technological University

Acknowledgements

We would like to thank the Linguistic Data Consortium for providing Mandarin-English Codeswitching in Southeast Asia (LDC2015S04) Corpus for the challenge.

Contact Us

For questions, please get in touch with Victoria at merlion.challenge@gmail.com.

Do join our mailing list or LinkedIn group for all challenge updates!