Paper Link:
Overview
Automatic Speech Recognition (ASR) models, like Whisper, perform exceptionally well in high-resource languages under clean audio conditions—but their robustness under noisy, low-resource environments remains uncertain.
In this project, we conduct the first systematic study of ASR robustness to noise for Sundanese and Javanese, two of Indonesia’s largest regional languages, by evaluating Whisper models under diverse noise conditions and training strategies
Methodology
Our pipeline combines SpecAugment and noise-aware training (NoiseTrain) to improve recognition in low signal-to-noise ratio (SNR) environments:
Data: We fine-tune Whisper on the OpenSLR corpora (≈60h training per language)
Noise Simulation: We mix background sounds from AudioSet (e.g., traffic, chatter, sirens) across 10 SNR levels.
Evaluation: Models are tested on both clean and noisy versions of the data using Word Error Rate (WER) and Character Error Rate (CER) as metrics.
Key Findings
Noise-aware fine-tuning improves robustness dramatically under low-SNR conditions, reducing WER by over 50% compared to clean-only training
SpecAugment also improves stability across moderate noise but is less effective under extreme noise.
Javanese errors are dominated by consonant and diacritic mistakes, while Sundanese shows more vowel confusions.
Larger Whisper models (e.g., Large-v3) achieve the best trade-off between robustness and efficiency.
Insights & Impact
This work highlights that ASR models still struggle in multilingual, noisy environments, especially for agglutinative, underrepresented languages. By integrating noise-aware and augmentation-based training, we provide a reproducible framework for developing robust ASR systems for low-resource languages—an essential step toward inclusive speech technologies.
Future Directions
We plan to expand this work toward dialect-aware fine-tuning and integrate speech enhancement front-ends for real-world noise handling.