Latest Updates (as of 28/5/2024):

The MERLIon CCS Challenge will remain open indefinitely to encourage model development. The development and evaluation set is publicly available (https://doi.org/10.21979/N9/ANXS8Z).

When using the dataset, please cite:

For a detailed description of the MERLIon CCS dataset, check out our Interspeech 2023 paper:

The winning systems for open and closed tracks presented the following papers during the INTERSPEECH 2023 special session:

We also presented an analysis of common errors where submitted systems collectively struggle when performing language identification on complex speech:

Updates (as of 14/8/2023):

🎉MERLIon CCS Challenge has been accepted as a special session at Interspeech 2023 🎉 We are looking forward to seeing everyone! 

The following papers related to the challenge have been accepted at Interspeech 2023:


The data archive for the MERLIon CCS dataset is now available:

About

The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.  

Due to a bias towards standard speech varieties, non-standard, accented speech remains an ongoing challenge for automatic speech processing. Although existing works have explored automatic speech recognition and language diarization in code-switching speech corpora, those tasks are still challenging for natural in-the-wild speech containing more than one language, particularly when the code-switching occurs in short language spans. Moreover, as child-directed speech contains acoustic features difficult for automatic language identification and language diarization, speech processing systems often struggle with natural speech of this kind. 

Aligning closely with Interspeech 2023’s theme, 'Inclusive Spoken Language Science and Technology – Breaking Down Barriers', we present the challenge of developing robust language identification and language diarization systems that are reliable for non-standard accented, bilingual, child-directed speech collected via a videocall platform. 

As videocalls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom videocall dataset: The MERLIon CCS Challenge will tackle automatic language identification and language diarization in a subset of audio recordings from the Talk Together Study, where parents narrated an onscreen wordless picturebook to their child. The main objectives of this inaugural challenge are: 

Techniques developed in the challenge may benefit other related fields allowing greater understanding of how code-switching occurs in real-life situations.

The challenge will feature language identification (Task 1) and language diarization (Task 2). Two tracks, open and closed, are available. The tracks differ by the data used during system training.

Register here!

DID YOU KNOW?✨ With the body of a mermaid and the head of a lion, the Merlion is a national icon of Singapore. ✨Just as the Merlion is a mix of different creatures, the Singaporean code-switched child-directed speech in this challenge is a mix of different languages✨