SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition


Desh Raj*, Daniel Povey^, Sanjeev Khudanpur`*

* Center for Language and Speech Processing, Johns Hopkins University, Baltimore MD, USA

` Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore MD, USA

^ Xiaomi Corp., Beijing, China

Abstract

The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors;  (ii) it is computationally expensive, due to which it has not seen adoption in academia; and  (iii) it has only been evaluated on synthetic mixtures. 

In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations.  In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v) use auxiliary objectives in the form of masking loss and encoder CTC loss, and (vi) perform domain adaptation for far-field recognition. 

We show that our modifications allow SURT 2.0 to outperform its predecessor in terms of multi-talker ASR results, while being efficient enough to train with academic resources. We conduct our evaluations on 3 publicly available meeting benchmarks --- LibriCSS, AMI, and ICSI, where our best model achieves WERs of 16.9%, 44.6% and 32.2%, respectively, on far-field unsegmented recordings. 

We release training recipes and pre-trained models: https://sites.google.com/view/surt2.

Model

Continuous, streaming, multi-talker ASR

We want to perform the continuous, streaming, multi-talker ASR task.


For now, we do not care about speaker attribution, i.e., the transcription is speaker agnostic. The evaluation depends on the particular model type. In this case, we use the optimal reference combination WER (ORC-WER) metric as implemented in the meeteval toolkit.


References:

Streaming Unmixing and Recognition Transducer (SURT)

We use the Streaming Unmixing and Recognition Transducer (SURT) model for this task. The model is based on the papers:


The model is a combination of a speech separation model and a speech recognition model, but trained end-to-end with a single loss function. The overall architecture is shown in the figure above. Note that this architecture is slightly different from the one in the above papers. A detailed description of the model can be found in our paper.

Recipes

We provide icefall recipes for LibriCSS, AMI, and ICSI.

How to run the recipes?

Icefall provides end-to-end reproducible training and inference scripts. If you are not familiar with the toolkit, it may be useful to first read the instructions and installation guidelines available at: https://icefall.readthedocs.io/en/latest/.

Once you have set up Icefall on your training infrastructure, you can run training from within the `egs/libricss/SURT` or `egs/ami/SURT` directories by running the following commands.

How to use the pre-trained models?

Pre-trained models are provided through HuggingFace:

To use these models, download them to your Icefall egs directories (ensure that model checkpoints, BPE, etc. are downloaded at correct paths), and rename the model to something like `epoch-99.pt`. You can then run decoding by simply using the options `--epoch 99 --avg 1 --use-averaged-model False`.

Results

We trained base (26.7 M params) and large (37.9 M params) SURT models, as described in the paper. The results with and without adaptation are shown in the tables below. 

LibriCSS

Anechoic

Replayed

AMI/ICSI

Citation

@article{Raj2023SURT2A, 

title={SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition}, 

author={Desh Raj and Daniel Povey and Sanjeev Khudanpur}, 

journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 

year={2023}, 

url={https://ieeexplore.ieee.org/document/10262308},

doi={10.1109/TASLP.2023.3318398}

}