Speech-to-Text: SenseVoice
posted in 2025
posted in 2025
(I) Objective: To build a complete fine-tuning pipeline for SenseVoice across any specific domain.
The pipeline includes:
Automatically generating training audio data
Verifying the correctness of generated audios
Converting data into the SenseVoice training and validation formats
Fine-tuning the SenseVoice model
(II) Collect text data: Collected medical documents and articles from both public websites and doctors associated with our clients.
(III) Generate data:
Utilized multiple Text-to-Speech (TTS) models (including Higgs, IndexTTS2, and Coqui), selecting the most accurate ones to enhance the diversity of the training dataset.
Generated speech data covered a wide range of speakers, accents, acoustic environments, background noises, emotions, spatial effects, and contextual variations (see Fig. 1).
To handle difficult word pronunciations in TTS models, I built a pinyin-based substitution dictionary that replaces hard-to-pronounce words with easier equivalents.
To verify the correctness of the generated audio, I used an ensemble of STT models (Whisper-Large-v3 and Qwen3-Omni) for cross-validation.
(Fig. 1) Covers a wide range of scenarios.
(IV) STT Post-Processing: For speech with unclear or inaccurate pronunciation, I applied pinyin-distance matching combined with LLM-based correction to refine and improve transcription accuracy.
(V) Results:
Evaluation metrics are summarized in Table 1, showing that:
The fine-tuned model achieves a lower Character Error Rate (CER) compared to the official baseline model.
The STT post-processing with pinyin correction further reduces the CER compared to processing without pinyin.
(Table 1) Character Error Rate (CER) is a metric in which lower values indicate better performance.