Introduction
LM can overcorrect the right result of AM
Audio Examples
① One-segment untargeted attack
② Targeted attack
③ Wrongly wake up a commercial voice assistant over-the-air
Existing adversarial example (AE) attacks against Automatic Speech Recognition (ASR) systems focus on adding deliberate noises to an input audio. In this paper, we propose a new attack that purely speeds up or slows down an original audio instead of adding perturbations, and we call it Time-Scale Modification Adversary Example (TSMAE). By investigating the impact of speed variation on 100,000 pieces of audio clips, we found that misrecognitions manifest in three categories—deletion, substitution, and insertion— and are the accumulated results caused by misrecognition of both the acoustic model and the language model inside an ASR system. Despite the challenges, i.e., ASR systems are typically black-box and reveal no gradient information, we managed to launch untargeted and targeted TSMAE attacks based on particle swarm optimization algorithms. Our untargeted attacks only require modifying the speed of one segment (e.g., 20 ms), and our targeted attacks can generate a meaningful yet benign sentence to cause an ASR system to output a malicious output, e.g., “open the door”. We validate the feasibility of TSMAE on 2 open-source ASR models (e.g., DeepSpeech and Sphinx) and 4 commercial ones (e.g., IBM, Google, Baidu, iFLYTEK). Results show that our untargeted attack is query-efficient and achieves a 100% success rate within 50 ASR queries for DeepSpeech while our targeted attack is robust to various factors, such as model versions and speech sources. Finally, both attacks can bypass existing open-source defense methods, and our insights call attention to the focus of defense from coping with perturbation to emerging AE attacks.
To find out the role of AM and LM in accounting for the misrecognized results, we select 1000 audio clips of the command of “take a picture” and keep the speed sequence of each audio clip unchanged. We evaluate the impact of the two models by comparing the ideal output (A0), the metadata (A1, i.e., the output of AM), and the output of LM (A2). This figure is an example of how language models overcorrect the correct result of AM, which is one of five cases about the text value of A0, A1 and A2. More concrete examples of this case can be viewed in the link.
There are some of our audio adversarial examples, including one-segment untargeted attack, targeted attack, and wrongly wake up a commercial voice assistant over-the-air.
We aim to cause an ASR system(Deepspeech 0.7.1) to fail to correctly recognize audio, but with the minimum effort, i.e., by changing the speed of only one small segment of the original audio.
Original Audio
Restart phone now
Adversarial Examples
Return
Original Audio
Can I help you
Adversarial Examples
Cannot help you
We aim to mislead an ASR system(Deepspeech 0.7.1) output target transcriptions that are commonly used in security-relevant scenarios, e.g., “open the door” in a smart home.
Original Audio
An old pen of the doctor
Adversarial Examples
Open the door
Original Audio
These durn red lights
Adversarial Examples
Turn right
Original Audio
Next order
Adversarial Examples
No
Original Audio
You eat salad
Adversarial Examples
Yes
Here we validate the feasibility of TSMAE attacking the commercial voice assistant (i.e., intentionally activate the Amazon Echo) in the over-the-air scenario. It is worth noting the maximum attack distance of “an elector” can reach more than 7𝑚 when it is only played by a small loudspeaker.