With the development of machine learning techniques, Speaker recognition (SR) has been widely deployed in our daily life as a convenient and secure way to recognize our identity. Recently, these machine learning techniques are demonstrated to be vulnerable to adversarial samples which are crafted by adding imperceptible perturbation to the benign samples. Although the success of adversarial attack on image recognition systems has been ported to speech recognition systems, little is known about the security implications of SR systems (SRSs) under such attack. Hence, we propose FAKEBOB, a systematical and practical attack on SRSs. Specifically, we consider API and over-the-air attack on three different recognition tasks as follows.
Below is an One-Minute Video Preview of our work.
In this scenario, the adversary is able to feed adversarial voices directly into speaker recognition systems via the API of systems.
An OSI system allows multiple speakers to be enrolled during the enrollment phase, forming a speaker group G. Given an arbitrary input voice, the system determines which speaker in G or none of them (called reject) utters this voice.
The five speakers in G and their corresponding enrollment voices are shown as follow.
Alice (female)
Mike (male)
Sarah (female)
Amy (female)
Bob (male)
original voice (from a male imposter) , decision: reject
adversarial voice 1, decision: Bob, SNR: 31.5 dB.
Note: Compared with the original voice, to human listening, this adversarial voice sounds uttered by the same speaker.
Compared with Bob's enrollment voice, it sounds uttered by distinctly different speaker. This makes our attack surreptitious.
However, the OSI system assigns it to Bob (OSI misbehaves, FAKEBOB successes).
adversarial voice 2, decision: Mike, SNR: 32.4 dB
original voice (from another male imposter) , decision: reject
adversarial voice 1, decision: Alice, SNR: 28.4 dB
adversarial voice 2, decision:Sarah , SNR: 29.7 dB
adversarial voice 3, decision: Amy, SNR: 29.7 dB
Unlike OSI systems, a CSI system will not reject any input voices, i.e., an input voice will always be classified as one of the enrolled speakers in G.
original voices, decision: Sarah (female)
7 adversarial voices under different $\epsilon$:
adversarial voice 1, $\epsilon=0.05$, SNR: 24.4 dB, decision: Alice
adversarial voice 2, $\epsilon=0.01$, SNR: 24.6 dB, decision: Alice
adversarial voice 3, $\epsilon=0.005$, SNR: 21.8 dB, decision: Alice
adversarial voice 4, $\epsilon=0.004$, SNR: 24.7 dB, decision: Alice
adversarial voice 5, $\epsilon=0.003$, SNR: 23.8 dB, decision: Alice
adversarial voice 6, $\epsilon=0.002$, SNR: 27.9 dB, decision: Alice
adversarial voice 7, $\epsilon=0.001$, SNR: 33.4 dB, decision: Alice
original voice, decision: Amy (female)
7 adversarial voices under different $\epsilon$:
adversarial voice 1, $\epsilon=0.05$, SNR: 14.1 dB, decision: Bob
adversarial voice 2, $\epsilon=0.01$, SNR: 14.9 dB, decision: Bob
adversarial voice 3, $\epsilon=0.005$, SNR: 20.0 dB, decision: Bob
adversarial voice 4, $\epsilon=0.004$, SNR: 21.8 dB, decision: Bob
adversarial voice 5, $\epsilon=0.003$, SNR: 24.5 dB, decision: Bob
adversarial voice 6, $\epsilon=0.002$, SNR: 27.6 dB, decision: Bob
adversarial voice 7, $\epsilon=0.001$, SNR: 36.8 dB, decision: Bob
Unlike OSI and CSi systems, a SV system only have one enrolled speakers. Given an input voice, a SV system checks whether this voice is uttered by the enrolled speaker, i.e., accept or reject.
Considering a SV system is enrolled by the speaker names Bob (male):
original voice (from a male imposter), decision: reject
adversarial voice, decision: accept, SNR: 27.5 dB
original voice (from a female imposter), decision: reject
adversarial voice, decision: accept, SNR: 37.1 dB
In this scenario, to launch an attack, the adversary has to play adversarial voices towards speaker recognition systems via loundspeakers. The adversarial voices are transmitted in the air and finally received by the built-in recorders of systems. The introduced environmental and electronic noise will disrupt the perturbation in the adversarial voice. To address this, we set relatively greater $\epsilon$ ($\epsilon=0.05$ or $\epsilon=0.1$) and increase the adversarial strength (i.e., confidence) of adversarial samples ($\kappa>0$).
We admit that in over-the-air attack scenario, the adversarial voices generated by our attack FAKEBOB are more noisy than those in API attack scenario. However, according to our human study, it is still hard for users to differentiate the identity of the original and adversarial voices.
We enrolled five speakers (i.e., Alice, Mike, Sarah, Amy, Bob) to build an OSI system on Microsoft Azure via HTTP REST API which Microsoft Azure Speaker Recognition Platform provides. Note: our attack on Microsoft Azure is both over-the-air attack and transferability attack.
original voice 01 (from a male imposter), decision: reject
adversarial voice 01, decision: Bob (intra-gender attack, $\epsilon=0.1$)
original voice 02 (from a male imposter), decision: reject
adversarial voice 02, decision: Alice (inter-gender atatck, $\epsilon=0.05$)
Our source code is available on github (Top) and gitee (Down).
To demonstrate the imperceptibility of adversarial samples, we conduct a human study on Amazon Mechanical Turk platform.
Specifically, we recruit participants from MTurk and ask them to choose one of the two tasks (i.e., Clean or Noisy, and Identify the speaker) and finish the corresponding questionnaire.
The voices we used in these two tasks are shown below.
All these audio files can be downloaded in *.zip format from url below:
https://drive.google.com/open?id=1ee136TS6TImgKv7uFIyaJdpvA1bLX-vY
All of these three voices are silent voices. Only those questionnaires who select "Clean" for all these three voices are considered valid.
One pair in a row. The two voices of a pair are original voices from the same speaker but with different text.
One pair in a row. The two voices of a pair are original voices from different speakers.
One pair in a row. One voice is original voice and the other is adversarial voice crafted from another original voice of the same speaker.
One pair in a row. One voice is original voice and the other is adversarial voice crafted from another original voice of the same speaker.
One pair in a row. One voice is from female speaker and the other is from male speaker.
Only those questionnaires who select "different" for all these these pairs are considered valid.