Task 1: Language Identification


Description

The first task of the MERLIon CCS challenge is language identification. During development, systems are provided with audio recordings where ground-truth language labels have been annotated with timestamps. Each audio segment will have a unique language label, either English or Mandarin. During evaluation, audio recordings are provided with timestamps but no language labels. The datasets for development and evaluation are discrete monolingual audio segments from the Talk Together Study.

There are open and closed tracks for this task, which places rules as to what training data can be used in the respective tracks. For more information, please see Datasets

Participation in the closed track for Task 1 (Language Identification) is compulsory for all teams participating in the challenge.

Scoring

The target languages to be evaluated in the challenge are English and Mandarin. Other languages that may appear in the recordings will not be evaluated. The primary evaluation metric is the equal error rate, while the secondary evaluation metric is the balanced accuracy. For more information on the evaluation metrics and guidelines, please refer to the Evaluation Plan.

The scoring script for generating these metrics are available on the GitHub repository.

Submission

Results for all audio segments to be evaluated must be enclosed in a single .txt file. Results for all audio segments to be evaluated should be enclosed in a single .txt file named prediction.txt and placed in a zip folder which should not contain any spaces in the folder name. 

The submission must have the following structure:

results.zip/

└── prediction.txt

For more information on the format guidelines, please refer to Appendix C of the Evaluation Plan

Results submission for the challenge will be on CodaLab. For more information, please refer to Submission.


Leaderboards