GenSEC Challenge at IEEE SLT 2024
Text-based Generative Speech Error Correction with LLMs
What is Generative Speech Error Correction (GenSEC)?
GenSEC Task 1 Description
LLM for Post-ASR Correction
This task focuses on mapping from n-best Hypotheses to ground truth speech Transcription (H2T). The training set includes AM scores from different pre-trained end-to-end ASR models and n-best hypotheses. The participants are allowed to use embedding from the first-pass acoustic or speech model to make the second pass model become multi-modal for hypotheses reranking or direct ground truth mapping. This challenge aims to open a connection to second-pass large language model (LLM) based rewriting for the speech community.
Dataset
HyPoradise training sets as the Training Set (316.8k pairs)
Development Sets are included: Librispeech-test-clean (2.6k pairs), WSJ-dev93 (503 pairs) and 50% of the other test set pairs in total of 5.5k pairs.
We have also prepared another self-recorded 25 hours unseen dev-set for evaluation with (502 pairs)Evaluation Sets are included: Librispeech-test-other (2.9k pairs), WSJ-dev93 (333 pairs) and 50% of the other test set pairs in total of 5.5k pairs. We have also prepared another self-recorded 25 hours unseen test-set for evaluation with (503 pairs)
Closed Sets - We contains a closed set to avoid data leakage issue and schedule to release per registeration.
Contact: hucky@nvidia.com
GenSEC Task 2 Description
Post-ASR Speaker Tagging Correction
Track-2 is a challenge track that aims to correct the speaker tagging (speaker labels) of the ASR-generated transcripts tagged with a speaker diarization system. The participants will submit speaker-error corrected transcripts and cpWER will be calculated for evaluations.
Task 2 Baseline:
Baseline, Rules and Submission Guideline: https://github.com/tango4j/llm_speaker_tagging
The baseline system is based on the system proposed in Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach.
Dataset
Access Task-2 Dataset on HuggingFace:
https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
Task 2 track only provides development set and evaluation set.
err_source_text: train 222 dev 13 files, eval 11 files
ref_annotated_text: train 222 dev 13 files, eval 11 files (not public, only through leaderboard)
Leaderboard
The result files `err_dev.hyp.seglst.json` and `err_eval.hyp.seglst.json` are automatically evaluated and added to the leaderboard.
Use your organization name and system names. You can submit multiple trials.
https://huggingface.co/spaces/GenSEC-LLM/task2_speaker_tagging_leaderboard
Technical Papers
Please submit a challenge submission paper through [CMT system]. Minimum 2 page - Max 6 page is allowed.
For templates, detailed requirements, please visit https://2024.ieeeslt.org/paper_submission/
June 20, 2024 : Paper submission deadline
June 27, 2024: Paper update deadline
The transcripts of multiple multi-speaker datasets are anonymized and altered to construct the dev and eval set of Track-2.
Contact: taejinp@nvidia.com
GenSEC Task 3 Description
Details & Register: Post-ASR LLM-Based Speech Emotion Recognition
Emotion Dataset: The IEMOCAP dataset will be provided. We utilize four emotion classes: angry, happy (+excited), neutral, and sad following previous work on IEMOCAP. We will provide ASR transcripts using eleven ASR models as the data.
You can use additional datasets to train your model, but they must NOT include IEMOCAP (except for the portion we provide to you). This is because we use part of IEMOCAP as our evaluation data. If additional datasets are used, you need to clearly mention the datasets used in the paper.
Baseline: We will provide a GPT-3.5 baseline model that takes Whisper-tiny transcripts as input.
Evaluation: We use unweighted four-class accuracy (number of correctly predicted samples / number of all samples). Participants are expected to perform emotion recognition using whatever automatic method based on the provided ASR transcripts.
Contact: yuanchao.li@ed.ac.uk , yuangong@mit.edu
System Paper Deadline in the official IEEE SLT proceeding
The authors have to submit their system draft by June 20th (abstract) to be included in the official SLT submission. Please describe your system and algorithm before June 20th in a 2 to 6 pages system draft with extra 2-pages refs.
Paper Submission (System and Method)
June 20, 2024
(CMT Link)
Paper Update (PDF revision)
June 27, 2024
Paper Notification (Potential Revision for Evaluation )
August 30, 2024
Organizing Chair and Committee
Task 1 -- Post-ASR Generative Correction
Chair: Dr. Huck Yang, Sr. Research Scientist, Nvidia
Committee:
Yuchen Hu, Nanyang Technological University
Chen Chen, Nanyang Technological University
Yen-Ting Lin, National Taiwan University
Dr. Zhehuai Chen, Nvidia
Rao Ma, Cambridge University
Task 2 -- Post-ASR Speaker Tagging Correction
Chair: Dr. TaeJin Park, Sr. Research Scientist, Nvidia
Committee:
Kunal Dhawan, Nvidia
Dr. Krishna Puvvada, Nvidia
Task 3 -- Post-ASR Speech Emotion Recognition
Co-Chair:
Dr. Yuan Gong, MIT
Yuanchao Li, University of Edinburgh
Committee:
Prof. Shrikanth (Shri) Narayanan, USC
Technical Committee
Prof. Eng Siong Chng, Nanyang Technological University
Dr. Andreas Stockle, University of California, Berkeley
Prof. Shinji Watanabe, CMU
Prof. Sabato Marco Siniscalchi, University of Palermo
Prof. Yu Tsao, Academia Sinica
Prof. Jun Du, University of Science and Technology of China
Prof. Chao Zhang, Tsinghua University
Dr. Boris Ginsburg, Nvidia
Dr. Kate Knill, Cambridge University
Prof. Peter Bell, University of Edinburgh
Dr. Catherine Lai, University of Edinburgh
Related References
Hyporadise: An open baseline for generative speech recognition with large language models, NeurIPS 2023
Related Task 1 Code: https://github.com/Hypotheses-Paradise/Hypo2Trans
Enhancing speaker diarization with large language models: A contextual beam search approach, ICASSP 2024
Related Task 2 Code: https://github.com/tango4j/llm_speaker_tagging
Listen, Think, and Understand, ICLR 2024
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Related Task 3 Code: https://github.com/YuanGongND/llm_speech_emotion_challenge
Task 1. ASR-LM Correction
Multi-task LM for post-ASR and post-Translation correction
Task 2. Speaker Tagging Correction
Post-ASR Speaker Tagging Correction
Task 3. ASR-LLM SER
Post-ASR LLM-Based Speech Emotion Recognition