We present Devil’s Whisper, a general adversarial attack on commercial black-box ASR(Automatic Speech Recognition) systems and IVC(Intelligent Voice Control) devices. Our AEs are stealthy enough to perceived by humans as normal music. Our attack remains effective towards multiple ASR API systems (e.g., Google Speech to Text, IBM Speech to Text, and Microsoft Bing Speech service) and IVC devices (Google Home, Google Assistant, Apple Siri, Amazon Echo, Microsoft Cortana).
IMPORTANT: due to the psychoacoustic effect, the adversarial audios in our demo would become more understandable after we reveal the commands hidden in the audios. So to test the human perception for our samples in a fair way, we suggest readers could play the samples to those who are unaware of our research without revealing the content.
In the first two videos, we ask Amazon Echo to play random music and turn off light, respectively.
In the two videos, we ask Google Assistant in Google Pixel phone to set an alarm and play random music, respectively.
In the two videos, we ask the Google Home Mini to turn off the light and take photos, respectively. (Although it cannot take photos yet.)
(The volume of videos in this section would be a little low, so you could turn up the speaker volume a little. We didn't use any software to adjust the volume since that would introduce noises to our demos.)
In the two videos, we ask Apple Siri on iPhone 6s to navigate to a place and show the local weather, respectively.
(Note: we wake the Siri up by human voice, since the wake-up words were not trained, details can be seen on our paper)
In the two videos, we ask Microsoft Cortana to show its functionality and the local weather, respectively.
In this chapter, we upload several demos to demonstrate that our Devil Whisper Attack could be effective in the realistic distance and volume.
Specifically, We use a digital sound level meter “SMART SENSOR AS824” to measure the volume. The background noise is about 50 dB, and the played audios are about 70 dB, compared to some special cases of the sound level, e.g., talking at 3 feet (65 dB), living room music (76 dB).
We show three demos totally. The first demo show that we could command Echo to turn off light up to 2 meters away. The second and third demo show we could command Microsoft Cortana in cell phone up to 50 centimeters away.