Home - SenseVoice

Speech-to-Text: SenseVoice

posted in 2025

(I) Objective: To build a complete fine-tuning pipeline for SenseVoice across any specific domain.

The pipeline includes:

Automatically generating training audio data
Verifying the correctness of generated audios
Converting data into the SenseVoice training and validation formats
Fine-tuning the SenseVoice model

(II) Collect text data: Collected medical documents and articles from both public websites and doctors associated with our clients.

(III) Generate data:

Utilized multiple Text-to-Speech (TTS) models (including Higgs, IndexTTS2, and Coqui), selecting the most accurate ones to enhance the diversity of the training dataset.
Generated speech data covered a wide range of speakers, accents, acoustic environments, background noises, emotions, spatial effects, and contextual variations (see Fig. 1).
To handle difficult word pronunciations in TTS models, I built a pinyin-based substitution dictionary that replaces hard-to-pronounce words with easier equivalents.
To verify the correctness of the generated audio, I used an ensemble of STT models (Whisper-Large-v3 and Qwen3-Omni) for cross-validation.

(Fig. 1) Covers a wide range of scenarios.

(IV) STT Post-Processing: For speech with unclear or inaccurate pronunciation, I applied pinyin-distance matching combined with LLM-based correction to refine and improve transcription accuracy.

(V) Results:

Evaluation metrics are summarized in Table 1, showing that:
- The fine-tuned model achieves a lower Character Error Rate (CER) compared to the official baseline model.
- The STT post-processing with pinyin correction further reduces the CER compared to processing without pinyin.

(Table 1) Character Error Rate (CER) is a metric in which lower values indicate better performance.

Google Sites

Report abuse