Datasets

Training Data

Closed Tracks

For the closed tracks of both Tasks 1 and 2, participants are only allowed to use the following to train their systems:

100 hours of clean speech from LibriSpeech ASR corpus (partition can be downloaded here).
- Note: There are other mirrors available at https://www.openslr.org/12/. Only the dataset labeled as training set of 100 hours "clean" speech (i.e., train-clean-100.tar.gz) should be used.
100 hours of preselected partition from the National Speech Corpus (Partition can be downloaded here)
200 hours of preselected partition from AISHELL (filenames of preselected partition here)
Mandarin-English Codeswitching in Southeast Asia (LDC2015S04) Corpus (also known as SEAME corpus)
- This corpus was provided for the MERLIon CCS challenge courtesy of the Linguistic Data Consortium (LDC). Only registered participants of the challenge will be allowed access to the corpus. The corpus will be distributed by the LDC. To access this dataset, please sign the LDC data license agreement and send it to LDC directly (ldc@ldc.upenn.edu). Once LDC have verified the agreement, the team will receive further instructions for access to the dataset.
- Note: If a team includes groups from more than one organization, each organization on a single team must license the data. The data cannot be shared across different organizations even if they are on the same team.

For the closed track, training data beyond the above datasets and their preselected partitions are not allowed. However, the MERLIon CCS development dataset may be used for training.

Please note that all participants in the challenge should make at least one valid submission to the closed track of Task 1 (Language Identification).

Open Tracks

For the open tracks of both Task 1 and 2, participants are allowed to use all training data in the closed track listed above plus up to an additional 100 hours of any publicly available or proprietary data to train their systems.

All training data should be thoroughly documented in the system description document at the end of the challenge, including how custom partitions of larger corpora were selected. Failure to document training data accurately will result in an invalid submission.

For a list of suggested training corpora, please refer to Evaluation Plan.

Pretrained models that are publicly available (e.g., wavLM and HuBERT) may also be used for the open tracks of the challenge. The name and version of the pretrained models must be documented in the system description document. Failure to document model and the version number accurately will result in an invalid submission. Proprietary pretrained models are not allowed.

The MERLIon CCS development dataset does not count towards the 100 hours limit in open tracks. It may be used for training or finetuning.

Development and Evaluation Data

The development set and evaluation set are Challenge-ready partitions of a larger dataset from the Talk Together Study which examines parent-child interactions in multilingual Singapore, and will be provided by the organizers.

How to access?

The MERLIon CCS development and evaluation sets are now available! Please register for the challenge and we will send you further instructions for accessing the dataset.

You should receive an email from us within 24 hours of sign up. If you do not receive news from, please check your Junk or Spam inboxes. Alternatively, you may contact Victoria at merlion.challenge@gmail.com.

In this study, parents narrated an onscreen wordless picturebook to their child over Zoom videoconferencing software. The Challenge ready partitions contains 305 Zoom audio recordings of 112 parent-child pairs featuring over 25 hours of child-directed speech in English and over 5 hours of child-directed speech in Mandarin. 103 of the parent-child pairs were recorded at least twice at separate timepoints (i.e., 2 recordings per pair), with a maximum of three recordings per pair.

Almost all adult voices in the dataset are female parents of a child under the age of 5 as well as the researchers. Voices of male family members, grandparents and children also occur in the dataset.

As part of the Talk Together Study, each audio recording is manually annotated and checked by at least 2 multilingual trained transcribers using ELAN, using the in-house BELA transcription protocol.

In the BELA transcription protocol, a subdivision of an utterance (due to code switch to a different language) is known as a ‘Grain’. Each utterance or subdivision is then labelled with boundaries for non-linguistic communicative acts including vocal sounds (e.g., humming), non-vocal sounds (e.g., clapping). Onsets and offsets of different languages are marked (Figure 1).

Figure 1. Transcription is done in ELAN. Transcribers were given special instructions to place the start-stop boundaries carefully both at the level of speaker turn, taking note to include sounds at the edge of words (such as fricatives at the end), and at the level of language information, taking note to include all word boundaries in each respective language in the event of language change. For the purposes of this challenge, transcriptions and translations will not be provided.

Diversity in Recording Environments, Accents and Language Use

As the Zoom calls were conducted in the homes of participating families on a variety of internet-enabled personal electronic devices including laptops, tablets and mobile phones, environmental background noise varied widely during recordings. Recordings were far-field using internal microphones.

Adults in our dataset use the Singaporean variety of English and Mandarin Chinese, which features different pronunciation from other standard varieties of English (in US and UK) and Mandarin (Putonghua) respectively. The Singaporean varieties also feature some unique vocabulary and grammar.

The Challenge dataset includes frequent code-switching within and between utterances. Only 244 out of 305 recordings feature two languages (English and Mandarin). In recordings with both languages, the proportion of Mandarin spoken overall ranged from 0.85% to 80.7%. The utterances are short; on average 1.4 seconds for English and 1.2 seconds for Mandarin.

Development and Evaluation Partitions

The Challenge Dataset is divided into a Development dataset (for system development) and an Evaluation Dataset (for test). The Evaluation Dataset is selected to be a representative subset of the data, where features such as distribution of grain size for English and Mandarin (Figure 2) and ratio of English to Mandarin per recording (Figure 3) are controlled. Ground truth language labels and timestamps will be provided for the Development dataset but withheld for the Evaluation dataset. To reduce overfitting to individual parent-child pairs, the voices in the Evaluation dataset do not appear in the Development dataset.

Figure 2. Density plots of the lengths of English and Mandarin grain sizes (in ms) in the Development set (Solid line) and Evaluation set (Dashed line). The strip plots below the each density plot indicates the distribution of all grains over both Development set (N = 9983 Mandarin grains; 40287 English grains) and Evaluation set (N = 9766 Mandarin grains; 39473 English grains).

Figure 3. Density plots of the total lengths of English and Mandarin speech (in ms) in each file in the Development set (Solid line) and Evaluation set (Dashed line). The strip plots below the each density plot indicates the distribution of lengths in each audio file over both Development set (N = 151) and Evaluation set (N = 154).

Figure 4. Proportion of English and Mandarin speech (in ms) in each file in the Development set (Solid line) and Evaluation set (Dashed line). Each stacked proportion bar represents a file in the Development set (N = 151) (Left) and Evaluation set (N = 154) (Right).

For more information on the Development and Evaluation sets of the MERLIon CCS Challenge, please refer to our Evaluation Plan.

Citation

Chua, Y. H. Victoria; Garcia Perera, Leibny Paola; Khudanpur, Sanjeev; Khong, Andy W. H.; Dauwels, Justin; Woon, Fei Ting; Styles, Suzy J, 2023, "Development and Evaluation data for Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge", https://doi.org/10.21979/N9/ANXS8Z, DR-NTU (Data), DRAFT VERSION.

Page updated

Google Sites

Report abuse