Commander song Demo

    Based on deep learning development, ASR (automatic speech recognition) systems have become quite popular recently. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. We would like to embed the voice commands into a song, called CommandSong and escape human detection. Taking an open source toolkit, we succeed in crafting random songs into any commands "carrier" for the wav-to-API attack. In addition, our wav-air-API attack playing the CommanderSongs and decoding the recorded audio manages to achieve 94% success rate. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance. 

(一) Audio adversarial samples for wav-to-API attack (Target Kaldi)

First set: Target command: okay google call one one zero one one nine one two zero

Song name: To my sky        Decoding results: 

Original audio: uh about er and me i guess uh canada oh god in [noise] read

Embed gadget

Adversarial audio: okay google call one one zero one one nine one two zero

Embed gadget


Second set: target command: okay google turn on GPS

Song name: Love story        Decoding results:

Original audio: oh jeez you warm Missouri's name as yeah after you were back then run with me

Embed gadget

Adversarial audio: okay google turn on g. p. s

Third set: Target command: echo capital one to make a credit card payment plan

Song name: Good time        Decoding results:

Original audio: well look ah call neuro rhine guide owned the game

Embed gadget

Adversarial audio: Echo ask capital one to make a credit card payment plan

Embed gadget

(二) Audio adversarial examples for wav-air-API attack (Target Kaldi)

Target command: okay google call one one zero one one nine one two zero

Song name: To my sky         Decoding results:

Original audio: uh about er and me i guess uh canada oh god in [noise] read

Embed gadget

Adversarial audio: 17dB distortion (10 times tests, success rate: almost 100%)

Embed gadget

Decoding results: 

  1. okay google call one one zero one one nine one two zero
  2. okay google call one one zero one one nine one two zero
  3. okay google call one one zero one one nine one two zero
  4. okay google call one one zero one one nine one two zero
  5. okay google call one one zero one one nine one two zero
  6. okay google call one one zero one one nine one two zero
  7. okay google call one one zero one one nine one two zero
  8. okay google call one one zero one one nine one two zero
  9. decay google call one one zero one one nine one two zero
  10. okay google called cupid one one zero one when nine one two zero
(三) Audio adversarial samples for phones (Target iFlytek Input)

Attack "iFlytek Input"