Home

Synthesizing Audio Adversarial Examples for Automatic Speech Recognition

(Accepted by KDD'22)

Adversarial examples in automatic speech recognition (ASR) are naturally sounded by humans yet capable of fooling well trained ASR models to transcribe incorrectly. Existing audio adversarial examples are typically constructed by adding constrained perturbations on benign audio inputs. Such attacks are therefore generated with an audio dependent assumption. For the first time, we propose the speech synthesising based attack (SSA), a novel threat model that constructs audio adversarial examples entirely from scratch (i.e., without depending on any existing audio) to fool cutting edge ASR models. To this end, we introduce a conditional variational auto-encoder (CVAE) as the speech synthesiser. Thereby, we formulate the adversarial audio synthesising task as an optimisation problem via searching in the hidden space of CVAE. Experiments on three dataset (i.e., audio mnist, common voice, and librispeech) show that our method could synthesise audios that are naturally sounded but misleading to the start-of-the-art ASR models. Source code will be available upon acceptance.

The audio play button may fail to work from time to time due to the instability of google site itself.

Please try clicking the arrow button to play. we apologise for the inconvenience.

Set 1

The below audio is synthesised to convey the content "Send a greeting email to Tom", while a well trained ASR model recognise it as "Transfer one million dollars to Jery"

tom_jery.wav

Targeted Attack Transcription

Transfer one million dollars to Jery

Set 2

The below audio is synthesised to convey the content "They remain divine regardless of men's opinion", while a well trained ASR model recognise it as "How came you to leave the key in the door"

cvae_attacked (1).wav

Targeted Attack Transcription

How came you to leave the key in the door

Set 3

Audio Semantic Content: Open the door please

OpenTheDoorPlease.wav

Targeted Attack Transcription

Close the window for me

Demos on Audio Mnist ("ONE" to ANY)

ASC: Audio Semantic Content, i.e., the ground truth text information in an audio signal.
TAT: Targeted Attack Transcription, i.e., the attacked transcription from ASR model.

one_zero.wav

ASC: ONE TAT: ZERO

one_two.wav

ASC: ONE TAT: TWO

one_three.wav

ASC: ONE TAT: THREE

one_four.wav

ASC: ONE TAT: FOUR

one_five.wav

ASC: ONE TAT: FIVE

one_six.wav

ASC: ONE TAT: SIX

one_seven.wav

ASC: ONE TAT: SEVEN

one_eight.wav

ASC: ONE TAT: EIGHT

one_nine.wav

ASC: ONE TAT: NINE

Demos on Audio Mnist (ANY to "ONE").

zero_one.wav

ASC: ZERO TAT: ONE

two_one.wav

ASC: TWO TAT: ONE

three_one.wav

ASC: THREE TAT: ONE

four_one.wav

ASC: FOUR TAT: ONE

five_one.wav

ASC: FIVE TAT: ONE

six_one.wav

ASC: SIX TAT: ONE

seven_one.wav

ASC: SEVEN TAT: ONE

eight_one.wav

ASC: EIGHT TAT: ONE

nine_one.wav

ASC: NINE TAT: ONE

Waveform Patten Analysis

Figure (a) and (b) respectively depict the original audio and the corresponding adversarially perturbed audio, based on previous audio dependent attack (i.e., C&W attack), where we can easily observe that the attacked audio needs to be restricted to only add minor perturbations. In contrast, the adversarial audio constructed by our SSA as shown in Figure (c) is free of such restriction, viz., the waveform can be significantly different.

extra2a.wav

extra2b.wav

extra2c.wav

Page updated

Google Sites

Report abuse