CommanderSong-Demo

Introduction

Based on deep learning development, ASR (automatic speech recognition) systems have become quite popular recently. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. We would like to embed the voice commands into a song, called CommanderSong and escape human detection. Taking an open source toolkit, we succeed in crafting random songs into any commands "carrier" for the wav-to-API attack. In addition, our wav-air-API attack playing the CommanderSongs and decoding the recorded audio manages to achieve 96% success rate. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance.

(1). Audio adversarial samples for wav-to-API attack

In this scenario, we assume the attacker could feed the adversarial audio directly into the speech recognition systems.

First Set:

Target command: okay google call one one zero one one nine one two zero

Song name: To my sky

Decoding results:

Original song audio: uh about er and me i guess uh canada oh god in [noise] read

Adversarial audio: okay google call one one zero one one nine one two zero

Second Set:

Target command: okay google turn on GPS

Song name: Love story

Decoding results:

Original audio: oh jeez you warm Missouri's name as yeah after you were back then run with me

Adversarial audio: okay google turn on g. p. s

Third set:

Target command: echo capital one to make a credit card payment plan

Song name: Good time

Decoding results:

Original audio: well look ah call neuro rhine guide owned the game

Adversarial audio: Echo ask capital one to make a credit card payment plan

(2). Audio adversarial examples for wav-air-API attack (Target Kaldi)

In this scenario, we assume the attacker could only play the audio over the air towards speech recognition systems. We need to admit one fact that the crafted song sounds more noisy this time since we have to overcome environmental effects in the physical world.

Target command: okay google call one one zero one one nine one two zero

Song name: To my sky

Decoding results:

Original audio: uh about er and me i guess uh canada oh god in [noise] read

Adversarial audio: 17dB distortion (10 times tests, success rate: almost 100%)

Decoding results:

  1. okay google call one one zero one one nine one two zero
  2. okay google call one one zero one one nine one two zero
  3. okay google call one one zero one one nine one two zero
  4. okay google call one one zero one one nine one two zero
  5. okay google call one one zero one one nine one two zero
  6. okay google call one one zero one one nine one two zero
  7. okay google call one one zero one one nine one two zero
  8. okay google call one one zero one one nine one two zero
  9. decay google call one one zero one one nine one two zero
  10. okay google called cupid one one zero one when nine one two zero

(3). Audio adversarial samples for cell phones (Target iFlytek Input App)

In this video, we show that our adversarial samples from wav-air-API attack could compromise iFlytek Speech Input App successfully. (Sorry that cell phone and the app UI language is not English, while the app could recognize English.)

(4). Automated Spreading of Audio adversarial samples

Since our WAA attack samples can be used to launch the practical adversarial attack against ASR systems, we want to explore the potential channels that can be leveraged to impact a large amount of victims automatically.

Here we show two demos: the first one is that we upload our samples to YouTube and then play the shared video towards cell phone. The second demo will show we are able to fake radio signal and broadcast our samples towards radio machines, then the radio machine will play our samples. Both demo prove we could compromise iFlytek Speech Input App in cell phone successfully.