オンライン/オフライン音声認識の統合 Integration of online/offline ASR models

オンライン/オフライン音声認識の統合（2023年）

Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training

End-to-end automatic speech recognition (ASR) has become an increasingly popular area of research, with two main models being online and offline ASR.

Online models aim to provide real-time transcription with minimal latency, whereas offline models wait until the end of the speech utterance before generating a transcription.

In this work, we explore three techniques to maximize the performance of each model by

1) proposing a joint parallel online and offline architecture for transducers;

2) introducing dynamic block (DB) training, which allows flexible block size selection and improves the robust-ness for the offline mode; and,

3) proposing a novel time-synchronous one-pass beam search using the online and of-fline decoders to further improve the performance of the offline mode.

This figure illustrates the architecture of the proposed method, which includes blockwise and full-context encoders capable of handling both online and offline modes.

In the offline mode, the hidden state vectors are vertically stacked. It is designed with the concept that the blockwise and full context encoders can better extract local and global features, respectively. The concatenation of the hidden state vectors from both encoders, Hon and Hoff , enables offline mode using both local and global features. Multitask learning of online and offline outputs can improve the robustness of the blockwise encoder as described. A one-pass beam search is then used to further improve offline mode performance.

This table presents the experimental results. The proposed method was compared to other approaches such as separately trained online/offline transducers and a cascaded encoder.

The proposed method demonstrated the best performance in both online and offline modes. In addition, the one-pass beam search, which tightly combines the online and offline modes, resulted in greater performance improvement than the two-pass rescoring.

The left figure shows the relationship between block size and CER during decoding of th online mode. The blue line represents the proposed model with the DB training, while the red and green lines represent the separate online model and the proposed model trained with a block size of 20, respectively. The proposed model with DB training outperformed the baseline for all block sizes, with a tradeoff between block size and CER.

The right figure shows the relationship between block size and CER for the offline mode.

The proposed model trained with a block size of 20 slightly outperformed the separate model when the block size was 20. However, offline mode performance decreased for block sizes other than 20. The proposed DB training achieved equal or better CER than the separate model for all block sizes. Notably, the proposed DB training resulted in enhanced robustness of the offline mode, outperforming the baseline.

We examined the effect of the proposed one-pass beam search. This figure shows the relationship between the decoder weight and CER. Compared to the case with no decoder weight, (µ = 0), the CER improved as µ increased, with the smallest CER at µ = 0.3. This improvement suggests that the decoder weight can play a significant role in enhancing the performance of the onepass beam search.

The next figure shows typical decoding results for the separate models and the proposed method. The incorrect decoding results are highlighted in red, and results that were worse in the offline mode than in the online mode are further bolded. The separated offline model could have more errors than the separated online model, whereas the proposed method benefit from the both modes, allowing to effectively correct the errors in the offline mode.

国際学会 / Peer reviewed conference paper

Y. Sudo, M. Shakeel, Y. Peng, and S. Watanabe, “Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training”, in Proc. INTERSPEECH, 2023, pp. 4479-4483.