SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition
Desh Raj*, Daniel Povey^, Sanjeev Khudanpur`*
* Center for Language and Speech Processing, Johns Hopkins University, Baltimore MD, USA
` Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore MD, USA
^ Xiaomi Corp., Beijing, China
Abstract
The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academia; and (iii) it has only been evaluated on synthetic mixtures.
In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v) use auxiliary objectives in the form of masking loss and encoder CTC loss, and (vi) perform domain adaptation for far-field recognition.
We show that our modifications allow SURT 2.0 to outperform its predecessor in terms of multi-talker ASR results, while being efficient enough to train with academic resources. We conduct our evaluations on 3 publicly available meeting benchmarks --- LibriCSS, AMI, and ICSI, where our best model achieves WERs of 16.9%, 44.6% and 32.2%, respectively, on far-field unsegmented recordings.
We release training recipes and pre-trained models: https://sites.google.com/view/surt2.
Model
Continuous, streaming, multi-talker ASR
We want to perform the continuous, streaming, multi-talker ASR task.
By "continuous", we mean that the model should be able to transcribe unsegmented audio without the need of an external VAD.
By "streaming", we mean that the model has limited right context. We use a right-context of at most 32 frames (320 ms).
By "multi-talker", we mean that the model should be able to transcribe overlapping speech from multiple speakers.
For now, we do not care about speaker attribution, i.e., the transcription is speaker agnostic. The evaluation depends on the particular model type. In this case, we use the optimal reference combination WER (ORC-WER) metric as implemented in the meeteval toolkit.
References:
Raj, D., Lu, L., Chen, Z., Gaur, Y., & Li, J. (2021). Continuous Streaming Multi-Talker ASR with Dual-Path Transducers. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7317-7321.
Sklyar, I., Piunova, A., Zheng, X., & Liu, Y. (2021). Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8402-8406.
von Neumann, T., Boeddeker, C., Kinoshita, K., Delcroix, M., & Haeb-Umbach, R. (2022). On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Streaming Unmixing and Recognition Transducer (SURT)
We use the Streaming Unmixing and Recognition Transducer (SURT) model for this task. The model is based on the papers:
Lu, Liang et al. “Streaming End-to-End Multi-Talker Speech Recognition.” IEEE Signal Processing Letters 28 (2020): 803-807.
Raj, Desh et al. “Continuous Streaming Multi-Talker ASR with Dual-Path Transducers.” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021): 7317-7321.
The model is a combination of a speech separation model and a speech recognition model, but trained end-to-end with a single loss function. The overall architecture is shown in the figure above. Note that this architecture is slightly different from the one in the above papers. A detailed description of the model can be found in our paper.
Recipes
We provide icefall recipes for LibriCSS, AMI, and ICSI.
How to run the recipes?
Icefall provides end-to-end reproducible training and inference scripts. If you are not familiar with the toolkit, it may be useful to first read the instructions and installation guidelines available at: https://icefall.readthedocs.io/en/latest/.
Once you have set up Icefall on your training infrastructure, you can run training from within the `egs/libricss/SURT` or `egs/ami/SURT` directories by running the following commands.
Run `prepare.sh` to prepare the training and evaluation data. It is recommended to run the steps one-by-one (like in Kaldi).
For training and decoding, please follow the commands provided in the corresponding README.md files in the recipes.
How to use the pre-trained models?
Pre-trained models are provided through HuggingFace:
LibriCSS: https://huggingface.co/desh2608/icefall-surt-libricss-dprnn-zipformer
AMI/ICSI: https://huggingface.co/desh2608/icefall-surt-ami-dprnn-zipformer
To use these models, download them to your Icefall egs directories (ensure that model checkpoints, BPE, etc. are downloaded at correct paths), and rename the model to something like `epoch-99.pt`. You can then run decoding by simply using the options `--epoch 99 --avg 1 --use-averaged-model False`.
Results
We trained base (26.7 M params) and large (37.9 M params) SURT models, as described in the paper. The results with and without adaptation are shown in the tables below.
LibriCSS
Anechoic
Replayed
AMI/ICSI
Citation
@article{Raj2023SURT2A,
title={SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition},
author={Desh Raj and Daniel Povey and Sanjeev Khudanpur},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2023},
url={https://ieeexplore.ieee.org/document/10262308},
doi={10.1109/TASLP.2023.3318398}
}