Watch Your Speed: Injecting Malicious Voice Commands via Time-Scale Modification

Introduction

LM can overcorrect the right result of AM

Audio Examples

① One-segment untargeted attack

② Targeted attack

③ Wrongly wake up a commercial voice assistant over-the-air

Introduction

Existing adversarial example (AE) attacks against Automatic Speech Recognition (ASR) systems focus on adding deliberate noises to an input audio. In this paper, we propose a new attack that purely speeds up or slows down an original audio instead of adding perturbations, and we call it Time-Scale Modification Adversary Example (TSMAE). By investigating the impact of speed variation on 100,000 pieces of audio clips, we found that misrecognitions manifest in three categories—deletion, substitution, and insertion— and are the accumulated results caused by misrecognition of both the acoustic model and the language model inside an ASR system. Despite the challenges, i.e., ASR systems are typically black-box and reveal no gradient information, we managed to launch untargeted and targeted TSMAE attacks based on particle swarm optimization algorithms. Our untargeted attacks only require modifying the speed of one segment (e.g., 20 ms), and our targeted attacks can generate a meaningful yet benign sentence to cause an ASR system to output a malicious output, e.g., “open the door”. We validate the feasibility of TSMAE on 2 open-source ASR models (e.g., DeepSpeech and Sphinx) and 4 commercial ones (e.g., IBM, Google, Baidu, iFLYTEK). Results show that our untargeted attack is query-efficient and achieves a 100% success rate within 50 ASR queries for DeepSpeech while our targeted attack is robust to various factors, such as model versions and speech sources. Finally, both attacks can bypass existing open-source defense methods, and our insights call attention to the focus of defense from coping with perturbation to emerging AE attacks.

LM can overcorrect the right result of AM

To find out the role of AM and LM in accounting for the misrecognized results, we select 1000 audio clips of the command of “take a picture” and keep the speed sequence of each audio clip unchanged. We evaluate the impact of the two models by comparing the ideal output (A0), the metadata (A1, i.e., the output of AM), and the output of LM (A2). This figure is an example of how language models overcorrect the correct result of AM, which is one of five cases about the text value of A0, A1 and A2. More concrete examples of this case can be viewed in the link.

Audio Examples

There are some of our audio adversarial examples, including one-segment untargeted attack, targeted attack, and wrongly wake up a commercial voice assistant over-the-air.

① One-segment untargeted attack

We aim to cause an ASR system(Deepspeech 0.7.1) to fail to correctly recognize audio, but with the minimum effort, i.e., by changing the speed of only one small segment of the original audio.

29.wav

Original Audio

Restart phone now

9.wav

Adversarial Examples

Return

1_ori.wav

Original Audio

Can I help you

1.wav

Adversarial Examples

Cannot help you

② Targeted attack

We aim to mislead an ASR system(Deepspeech 0.7.1) output target transcriptions that are commonly used in security-relevant scenarios, e.g., “open the door” in a smart home.

an old pen of the doctor.wav

Original Audio

An old pen of the doctor

open the door.wav

Adversarial Examples

Open the door

these_durn_red_lights.wav

Original Audio

These durn red lights

turn right.wav

Adversarial Examples

Turn right

next order.wav

Original Audio

Next order

next order-no.wav

Adversarial Examples

you eat salad.wav

Original Audio

You eat salad

you eat salad-yes.wav

Adversarial Examples

Yes

③ Wrongly wake up a commercial voice assistant over-the-air

Here we validate the feasibility of TSMAE attacking the commercial voice assistant (i.e., intentionally activate the Amazon Echo) in the over-the-air scenario. It is worth noting the maximum attack distance of “an elector” can reach more than 7𝑚 when it is only played by a small loudspeaker.

Page updated

Google Sites

Report abuse