Automatic Speech Recognition (Speech to text)
posted in 2024
posted in 2024
(I) Objective: Fine-tune the Whisper small model using Chinese and English audio datasets to achieve performance levels comparable to the Whisper large-v3 model.
(II) Whisper Mechanism (Fig. 1):
First, the raw audio input is converted to a log-Mel spectrogram by the feature extractor.
Next, the encoder processes the spectrogram, generating a sequence of encoder hidden states.
Finally, the decoder uses both the previous tokens and the encoder hidden states to predict autoregressively text tokens.
(Fig. 1) Whisper model
(III) Speed up inference:
The Whisper decoder with past_kv_cache (Fig. 2): By caching the previous Keys and Values, we only need to compute the current Keys and Values for the new token.
(Fig. 3) shows the MACs of the Whisper deoder with and without past_kv_cache (light blue and purple backgrounds, respectively).
For the self-attention of the decoder, using past_kv_cache only computes the current Keys and Values for the new token, whereas without past_kv_cache, the decoder recomputes the Keys and Values for all tokens, as shown in (Fig. 3, 4).
For the cross-attention of the decoder, using past_kv_cache computes the Keys and Values from the encoder's hidden states only at Time step 1, whereas without past_kv_cache, the decoder recomputes the Keys and Values at all Time steps 1, 2, 3, ..., n, as shown in (Fig. 3, 4).
(Fig. 2) whisper decoder with kv_cache.
(Fig. 3) The MACs of whisper deoder.
(Fig. 4) The self-attention and cross-attention of whisper deoder.
(IV) Dataset: Utilized a diverse and comprehensive set of publicly available datasets, including:
English Datasets: Common Voice, Fleurs, LibriSpeech, SPGISpeech, GigaSpeech.
Chinese Datasets: Common Voice, Aishell1, Aishell2, MAGICDATA Mandarin Chinese Read Speech Corpus, MagicData RAMC, Primewords Chinese Corpus Set 1, aidatatang_200zh, THCHS-30, TALCS, zhvoice, WeNetSpeech.
(V) Results:
Evaluation metrics are presented in Table 1, showing:
Character Error Rate (CER) for Chinese testing dataset.
Word Error Rate (WER) for English testing datasets.
Official Whisper models (tiny, base, small, medium, largeV2, largeV3) are highlighted with an orange background, while the fine-tuned small model is highlighted with a green background.
The fine-tuned Whisper small model achieves Character Error Rate (CER) and Word Error Rate (WER) comparable to, or even lower than, those of the official Whisper large-v3 model.
(Table. 1) Character Error Rate (CER) and Word Error Rate (WER) are metrics where lower values indicate better performance.
The evaluation above utilizes greedy search(Fig. 5), which selects the word with the highest probability at each step.
(Fig. 5) greedy search