Cycle-Consistent GAN Front-end to Improve ASR Robustness to Perturbed Speech

Paper presented at the Interpretability and Robustness in Audio, Speech, and Language (IRASL) Workshop

Conference on Neural Information Processing Systems (NeurIPS/NIPS) 2018

  1. Abstract: Automatic Speech Recognition (ASR) systems, which perform well on regular speech, are found to be vulnerable to adversarial examples generated by small perturbations in the audio signal. Even naturally introduced perturbations in audio signal, caused by emotional and physical states of the speaker, can significantly degrade ASR performance. We propose a front-end based on Cycle-Consistent Generative Adversarial Network (CycleGAN) to reduce the perturbations, and hence add robustness to ASR performance. CycleGAN is trained using non-parallel examples of perturbed and normal speech. Experiments on spontaneously generated laughter-speech and creaky voice datasets tested with Google cloud ASR show absolute improvements in WER of 14.9% and 11%, respectively, on speech converted using the CycleGAN based front-end as compared to the original perturbed speech.
  2. Source: jupyter notebook (download)
  3. Submitted Paper (download)
  4. Results at a Glance