GenSEC

GenSEC Challenge at IEEE SLT 2024

Text-based Generative Speech Error Correction with LLMs

What is Generative Speech Error Correction (GenSEC)?

GenSEC Task 1 Description

LLM for Post-ASR Correction

This task focuses on mapping from n-best Hypotheses to ground truth speech Transcription (H2T). The training set includes AM scores from different pre-trained end-to-end ASR models and n-best hypotheses. The participants are allowed to use embedding from the first-pass acoustic or speech model to make the second pass model become multi-modal for hypotheses reranking or direct ground truth mapping. This challenge aims to open a connection to second-pass large language model (LLM) based rewriting for the speech community.

Dataset

HyPoradise training sets as the Training Set (316.8k pairs)
Development Sets are included: Librispeech-test-clean (2.6k pairs), WSJ-dev93 (503 pairs) and 50% of the other test set pairs in total of 5.5k pairs.
We have also prepared another self-recorded 25 hours unseen dev-set for evaluation with (502 pairs)
Evaluation Sets are included: Librispeech-test-other (2.9k pairs), WSJ-dev93 (333 pairs) and 50% of the other test set pairs in total of 5.5k pairs. We have also prepared another self-recorded 25 hours unseen test-set for evaluation with (503 pairs)
Closed Sets - We contains a closed set to avoid data leakage issue and schedule to release per registeration.

Task 1 Leaderboard

GenSEC Task 2 Description

Post-ASR Speaker Tagging Correction

Track-2 is a challenge track that aims to correct the speaker tagging (speaker labels) of the ASR-generated transcripts tagged with a speaker diarization system. The participants will submit speaker-error corrected transcripts and cpWER will be calculated for evaluations.

Task 2 Baseline:
Baseline, Rules and Submission Guideline: https://github.com/tango4j/llm_speaker_tagging

The baseline system is based on the system proposed in Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach.

Dataset

Access Task-2 Dataset on HuggingFace:

https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging

Task 2 track only provides development set and evaluation set.

err_source_text: train 222 dev 13 files, eval 11 files
ref_annotated_text: train 222 dev 13 files, eval 11 files (not public, only through leaderboard)

Leaderboard

The result files `err_dev.hyp.seglst.json` and `err_eval.hyp.seglst.json` are automatically evaluated and added to the leaderboard.
Use your organization name and system names. You can submit multiple trials.

https://huggingface.co/spaces/GenSEC-LLM/task2_speaker_tagging_leaderboard

Technical Papers

Please submit a challenge submission paper through [CMT system]. Minimum 2 page - Max 6 page is allowed.
For templates, detailed requirements, please visit https://2024.ieeeslt.org/paper_submission/

June 20, 2024 : Paper submission deadline
June 27, 2024: Paper update deadline

The transcripts of multiple multi-speaker datasets are anonymized and altered to construct the dev and eval set of Track-2.

Contact: taejinp@nvidia.com

GenSEC Task 3 Description

Details & Register: Post-ASR LLM-Based Speech Emotion Recognition
Emotion Dataset: The IEMOCAP dataset will be provided. We utilize four emotion classes: angry, happy (+excited), neutral, and sad following previous work on IEMOCAP. We will provide ASR transcripts using eleven ASR models as the data.

You can use additional datasets to train your model, but they must NOT include IEMOCAP (except for the portion we provide to you). This is because we use part of IEMOCAP as our evaluation data. If additional datasets are used, you need to clearly mention the datasets used in the paper.

Baseline: We will provide a GPT-3.5 baseline model that takes Whisper-tiny transcripts as input.
Evaluation: We use unweighted four-class accuracy (number of correctly predicted samples / number of all samples). Participants are expected to perform emotion recognition using whatever automatic method based on the provided ASR transcripts.
Contact: yuanchao.li@ed.ac.uk , yuangong@mit.edu

System Paper Deadline in the official IEEE SLT proceeding

The authors have to submit their system draft by June 20th (abstract) to be included in the official SLT submission. Please describe your system and algorithm before June 20th in a 2 to 6 pages system draft with extra 2-pages refs.

Paper Submission (System and Method)

June 20, 2024
(CMT Link)

Paper Update (PDF revision)

June 27, 2024

Paper Notification (Potential Revision for Evaluation )

August 30, 2024

Organizing Chair and Committee

Task 1 -- Post-ASR Generative Correction
- Chair: Dr. Huck Yang, Sr. Research Scientist, Nvidia
- Committee:
  - Yuchen Hu, Nanyang Technological University
  - Chen Chen, Nanyang Technological University
  - Yen-Ting Lin, National Taiwan University
  - Dr. Zhehuai Chen, Nvidia
  - Rao Ma, Cambridge University

Task 2 -- Post-ASR Speaker Tagging Correction
- Chair: Dr. TaeJin Park, Sr. Research Scientist, Nvidia
- Committee:
  - Kunal Dhawan, Nvidia
  - Dr. Krishna Puvvada, Nvidia

Task 3 -- Post-ASR Speech Emotion Recognition
- Co-Chair:
  - Dr. Yuan Gong, MIT
  - Yuanchao Li, University of Edinburgh
- Committee:
  - Prof. Shrikanth (Shri) Narayanan, USC

Technical Committee

Prof. Eng Siong Chng, Nanyang Technological University
Dr. Andreas Stockle, University of California, Berkeley
Prof. Shinji Watanabe, CMU
Prof. Sabato Marco Siniscalchi, University of Palermo
Prof. Yu Tsao, Academia Sinica
Prof. Jun Du, University of Science and Technology of China

Prof. Chao Zhang, Tsinghua University
Dr. Boris Ginsburg, Nvidia
Dr. Kate Knill, Cambridge University
Prof. Peter Bell, University of Edinburgh
Dr. Catherine Lai, University of Edinburgh

Related References

Hyporadise: An open baseline for generative speech recognition with large language models, NeurIPS 2023
- Related Task 1 Code: https://github.com/Hypotheses-Paradise/Hypo2Trans

Enhancing speaker diarization with large language models: A contextual beam search approach, ICASSP 2024
- Related Task 2 Code: https://github.com/tango4j/llm_speaker_tagging

Listen, Think, and Understand, ICLR 2024

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

- Related Task 3 Code: https://github.com/YuanGongND/llm_speech_emotion_challenge

Task 1. ASR-LM Correction

Multi-task LM for post-ASR and post-Translation correction

Task 2. Speaker Tagging Correction

Post-ASR Speaker Tagging Correction

Task 3. ASR-LLM SER

Post-ASR LLM-Based Speech Emotion Recognition

Come to Join US in SLT 2024

Page updated

Google Sites

Report abuse