FAQs

Q: How do we get the MERLIon CCS dataset?

A: Sign up for the challenge! We will give you instructions on how to access our dataset after we receive your application.

Q: I signed up for the challenge but did not receive any information to access the MERLIon CCS dataset. What should I do?

A: You should receive an email from us within 24 hours of sign up. If you do not see any emails from us, please check your Junk or Spam inboxes. Alternatively, you may contact Victoria at merlion.challenge@gmail.com.


Q: Does my team need to participate in both tasks?

A: Only participation in the closed track of Task 1 (Language Identification) is compulsory. 

 

Q: Does my team need to participate in both the open and closed tracks for both tasks?

A: Only participation in the closed track of Task 1 (Language Identification) is compulsory. Any track is Task 2 (Language Diarization) is optional.


Q: Can I use pretrained models for the open tracks?

A: Pretrained models that are publicly available (e.g., wavLM and HuBERT) may be used for the open tracks of the challenge. The name and version of the pretrained models must be documented in the system description document. Failure to document model and the version number accurately will result in an invalid submission. Proprietary pretrained models are not allowed. 


Q: Are there restrictions for the additional training data we might use for the open tracks?

A: We have open tracks for Tasks 1 and 2 which allows for additional training data. 

Two guidelines: 

1) only an additional of up to 100 hours is allowed, i.e., participants may use any amount (0 to 100%)  of the closed track datasets + 100 hours of any publicly available or proprietary data.

2) data used must be thoroughly documented in the system description document at the end of the challenge. All training data should be thoroughly documented in the system description document at the end of the challenge, including how custom partitions of larger corpora were selected. Failure to document training data accurately will result in an invalid submission. 

Note that if the team is using pretrained models, the limit of additional 100 training hours can be used for finetuning the pretrained models. 


Q: Can I use the MERLIon CCS Development dataset for training or finetuning?

A: In all tracks and tasks, you may use the MERLIon CCS development dataset for training or finetuning models. The MERLIon CCS development dataset does not count towards the 100 hours limit in open tracks. 


Q: Must we submit a paper to the Interspeech special session? 

A: Submission to the session is strongly encouraged but not compulsory for participation in the challenge. 


Q: Which programming languages can we use?

A: You are free to use any programming language you like. For system evaluation of Task 2 (Language Diarization), we will require you to submit the output decisions as a Rich Transcription Time Marked (RTTM) file. Please refer to our Evaluation Plan for more details.


Q: Can our teams re-distribute the data in this challenge?

A: Please do not re-distribute the MERLIon CCS data beyond the challenge. The member of your team who registers will have to sign a data use agreement and will act as a data custodian for your team.

Beyond the challenge, the plan for further redistribution and use of the dataset has yet to be confirmed. However, plans are underway to release parts of it as a public corpora, so please await our updates! 


Q: In which format should our team submit output?

A: Results should be submitted via CodaLab. Details on how to format your output data for submission is outlined in the Evaluation Plan (section on Scoring and Appendix C).


Q: How do we submit our results?

A: For the evaluation, we will be using CodaLab for results submission. You will need to sign up for a CodaLab account. Please refer to Submission for instructions on how to sign up for an account and submit results for each respective tracks.

 

Q: Must we submit a system description?

A: All teams are required to submit a system description at the end of evaluation period that describes their submissions with sufficient approach for a researcher to understand the approach as well as the data and computational resources required to train and run the system. 

Documentation of training data used in the open tracks must be outlined, including but not limited to, the procedures for curating custom partitions of larger corpora. For more information, please refer to Appendix D of the Evaluation Plan.


Q: How can we upload our system description?

A: At the end of evaluation period, the contact person of each team should email the system description to merlion.challenge@gmail.com with SYSTEM DESCRIPTION in the email subject. The format of the system description is outlined in Appendix D of the Evaluation Plan.


Q: My team participated in both open and closed tracks. Should we submit different system descriptions for each track?

A: Only one document submission is required. The system description document should clearly state which tracks the team participated in. If you used the same system and only changed the training regime between open and closed tracks, the system description document should clearly reflect that. However, if different systems were used, you should submit two different descriptions in a single document detailing the systems, the tracks they were used for, and their parameters.  For more information, please refer to Appendix D of the Evaluation Plan.


Q: Why is the challenge called MERLIon? 

A: With the body of a mermaid and the head of a lion, the Merlion is a national icon of Singapore. Just as the Merlion is a mix of different creatures, the code-switched child-directed speech in this challenge is a mix of different languages.