Latest Updates (as of 28/5/2024):
The MERLIon CCS Challenge will remain open indefinitely to encourage model development. The development and evaluation set is publicly available (https://doi.org/10.21979/N9/ANXS8Z).
When using the dataset, please cite:
Chua, Victoria Yi Han; Garcia Perera, Leibny Paola; Khudanpur, Sanjeev; Khong, Andy W. H.; Dauwels, Justin; Woon, Fei Ting; Styles, Suzy J, 2023, "Development and Evaluation data for Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge", https://doi.org/10.21979/N9/ANXS8Z, DR-NTU (Data), V1
For a detailed description of the MERLIon CCS dataset, check out our Interspeech 2023 paper:
Y. H. V. Chua, H. Liu, L. P. Garcia Perera, F. T. Woon, J. Wong, X. Zhang, S. Khudanpur, A. W. H. Khong, J. Dauwels, and S. J. Styles, MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization, doi: 10.21437/Interspeech.2023-1446.
The winning systems for open and closed tracks presented the following papers during the INTERSPEECH 2023 special session:
S. K. Gupta, S. Hiray, P. Kukde (2023). Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech, doi: 10.21437/Interspeech.2023-1335
M. Shahin, Z. Nan, V. Sethu, B. Ahmed (2023). Improving wav2vec2-based Spoken Language Identification by Learning Phonological Features, doi: 10.21437/Interspeech.2023-2533.
K. Praveen, B. Radhakrishnan, K. Sabu, A. Pandey, M. A. B. Shaik (2023). Language Identification Networks for Multilingual Everyday Recordings, doi: 10.21437/Interspeech.2023-2047.
We also presented an analysis of common errors where submitted systems collectively struggle when performing language identification on complex speech:
S. J. Styles, Y. H. V. Chua, F. T. Woon, H. Liu, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, and J. Dauwels (2023). Investigating model performance in language identification: beyond simple error statistics, doi: 10.21437/Interspeech.2023-1707
Updates (as of 14/8/2023):
🎉MERLIon CCS Challenge has been accepted as a special session at Interspeech 2023 🎉 We are looking forward to seeing everyone!
The following papers related to the challenge have been accepted at Interspeech 2023:
Y. H. V. Chua, H. Liu, L. P. Garcia Perera, F. T. Woon, J. Wong, X. Zhang, S. Khudanpur, A. W. H. Khong, J. Dauwels, and S. J. Styles, "MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization". http://arxiv.org/abs/2305.18881
S. J. Styles, Y. H. V. Chua, F. T. Woon, H. Liu, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, and J. Dauwels, “Investigating model performance in language identification: beyond simple error statistics". http://arxiv.org/abs/2305.18925
The data archive for the MERLIon CCS dataset is now available:
Y. H. V. Chua, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, J. Dauwels, F. T. Woon, and S. J. Styles, “Development and Evaluation data for Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge,” 2023, DR-NTU (Data), V1, https://doi.org/10.21979/N9/ANXS8Z
About
The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.
Due to a bias towards standard speech varieties, non-standard, accented speech remains an ongoing challenge for automatic speech processing. Although existing works have explored automatic speech recognition and language diarization in code-switching speech corpora, those tasks are still challenging for natural in-the-wild speech containing more than one language, particularly when the code-switching occurs in short language spans. Moreover, as child-directed speech contains acoustic features difficult for automatic language identification and language diarization, speech processing systems often struggle with natural speech of this kind.
Aligning closely with Interspeech 2023’s theme, 'Inclusive Spoken Language Science and Technology – Breaking Down Barriers', we present the challenge of developing robust language identification and language diarization systems that are reliable for non-standard accented, bilingual, child-directed speech collected via a videocall platform.
As videocalls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom videocall dataset: The MERLIon CCS Challenge will tackle automatic language identification and language diarization in a subset of audio recordings from the Talk Together Study, where parents narrated an onscreen wordless picturebook to their child. The main objectives of this inaugural challenge are:
to benchmark the current and novel language identification and language diarization systems in a code-switching scenario including extremely short utterances;
to test the robustness of such systems under accented speech;
to inspire the research community to propose novel solutions in terms of adaptation, training, and novel embedding extraction for this particular set of tasks;
Techniques developed in the challenge may benefit other related fields allowing greater understanding of how code-switching occurs in real-life situations.
The challenge will feature language identification (Task 1) and language diarization (Task 2). Two tracks, open and closed, are available. The tracks differ by the data used during system training.
✨DID YOU KNOW?✨ With the body of a mermaid and the head of a lion, the Merlion is a national icon of Singapore. ✨Just as the Merlion is a mix of different creatures, the Singaporean code-switched child-directed speech in this challenge is a mix of different languages✨
Important Dates
All deadlines are AOE!
Registrations Open : 18 Jan 2023 🎉
Registrations Close: 24 Feb 2023 🎉
Training Data Partitions Release: 25 Jan 2023 🎉
Evaluation Plan Release: 27 Jan 2023 🎉
Data Release (Development Set): 27 Jan 2023 🎉
Baseline System Release: 13 Feb 2023 🎉
Data Release (Evaluation Set): 16 Feb 2023 🎉
Leaderboard Active: 17 Feb 2023 🎉
Official Evaluation Closes (Leaderboard Freeze): 28 Feb 2023 Extended to 2 Mar 2023!
INTERSPEECH Paper Submission Closes: 1 Mar 2023
System Description Submission: 2 Mar 2023
INTERSPEECH Paper Update Submission Closes: 8 Mar 2023
Leaderboard Reopens*: 10 Mar 2023
INTERSPEECH Acceptance: 17 May 2023
*After the end of the official challenge period, the leaderboard will reopen for teams who want to continue developing their systems prior to Interspeech session (optional).
Organizers
Leibny Paola Garcia Perera, Johns Hopkins University
YH Victoria Chua, Nanyang Technological University
Hexin Liu, Nanyang Technological University
Fei Ting Woon, Nanyang Technological University
Andy Khong, Nanyang Technological University
Justin Dauwels, TU Delft
Sanjeev Khudanpur, John Hopkins University
Suzy J Styles, Nanyang Technological University
Acknowledgements
We would like to thank the Linguistic Data Consortium for providing Mandarin-English Codeswitching in Southeast Asia (LDC2015S04) Corpus for the challenge.
Contact Us
For questions, please get in touch with Victoria at merlion.challenge@gmail.com.
Do join our mailing list or LinkedIn group for all challenge updates!