音声認識モデルのカスタマイズ(2023年)

Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation 

End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji characters. Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. 

This figure shows the overview of the proposed method consists of a named entity aware ASR model, a dictionary, a phoneme similarity estimator, and an error correction.

The dictionary contains token and phoneme sequences of the target named entity provided by the end-user.

Firstly, the NEA-ASR model recognizes named entities by tagging the target NE words with special tokens in addition to the normal ASR results. It simultaneously outputs the phoneme sequence of the tagged NE words. 

Then, the phoneme similarity estimator estimates the phoneme similarity between the tagged phoneme sequence and those in the dictionary.

If the phoneme similarity exceeds a certain threshold, the tagged token sequence is replaced by the one in the dictionary.

This table compares the proposed method with the baseline normal CTC/attention.

The proposed method outperformed the baseline for both overall CER and CER-NE.


The right figure shows the results of the proposed method before and after the error correction. The proposed NEA-ASR model tagged not only in-vocabulary words but also OOV words, with a total of 88.3% of the person names being tagged.

Before error correction, 13% of the in-vocabulary personal names and 90% of the OOV personal names had substitution errors, whereas after error correction these were improved to 5% and 15%, as shown in orange and pink, respectively.

We tested the effect of dictionary size. This figure shows the effect of dictionary size.

The CER-NE without a dictionary was 46.5%, but the CER-NE improved significantly when appropriate person names were added to the dictionary.

As the size of the dictionary increased, the CER-NE gradually increased, but even when 1,000 person names were registered in the dictionary, the CER-NE was still better than the baseline.

We also tested the effect of the phoneme similarity threshold, Vth. This table shows the effect of threshold Vth on CER-NE.

When Vth = 1.0, personal names are replaced only when the estimated phoneme sequence exactly matches the phoneme sequence in the dictionary. Even a small error in the phoneme estimation leads to a performance degradation.

Conversely, when Vth = 0.0, all tagged words are replaced even if the estimated phoneme sequence is far from any phoneme sequence in the dictionary, which increases replacement errors.

By tuning the threshold Vth to 0.5, CER-NE was the best because these two problems were avoided.

 国際学会 / Peer reviewed conference paper