FakeBob

Introduction

With the development of machine learning techniques, Speaker recognition (SR) has been widely deployed in our daily life as a convenient and secure way to recognize our identity. Recently, these machine learning techniques are demonstrated to be vulnerable to adversarial samples which are crafted by adding imperceptible perturbation to the benign samples. Although the success of adversarial attack on image recognition systems has been ported to speech recognition systems, little is known about the security implications of SR systems (SRSs) under such attack. Hence, we propose FAKEBOB, a systematical and practical attack on SRSs. Specifically, we consider API and over-the-air attack on three different recognition tasks as follows.

Below is an One-Minute Video Preview of our work.

API Attack

In this scenario, the adversary is able to feed adversarial voices directly into speaker recognition systems via the API of systems.

Attack on Open-set Identification (OSI)

An OSI system allows multiple speakers to be enrolled during the enrollment phase, forming a speaker group G. Given an arbitrary input voice, the system determines which speaker in G or none of them (called reject) utters this voice.

The five speakers in G and their corresponding enrollment voices are shown as follow.

Alice (female)

Mike (male)

Sarah (female)

Amy (female)

Bob (male)

Case 1: intra-gender Attack ($\epsilon=0.002$, $\kappa=0$)

original voice (from a male imposter) , decision: reject

adversarial voice 1, decision: Bob, SNR: 31.5 dB.

Note: Compared with the original voice, to human listening, this adversarial voice sounds uttered by the same speaker.

Compared with Bob's enrollment voice, it sounds uttered by distinctly different speaker. This makes our attack surreptitious.

However, the OSI system assigns it to Bob (OSI misbehaves, FAKEBOB successes).

adversarial voice 2, decision: Mike, SNR: 32.4 dB

Case 2: Inter-gender Attack ($\epsilon=0.002$, $\kappa=0$)

original voice (from another male imposter) , decision: reject

adversarial voice 1, decision: Alice, SNR: 28.4 dB

adversarial voice 2, decision:Sarah , SNR: 29.7 dB

adversarial voice 3, decision: Amy, SNR: 29.7 dB

Attack on Close-set Identification (CSI)

Unlike OSI systems, a CSI system will not reject any input voices, i.e., an input voice will always be classified as one of the enrolled speakers in G.

Case 1: Intra-gender Attack ($\kappa=0$)

original voices, decision: Sarah (female)

7 adversarial voices under different $\epsilon$:

adversarial voice 1, $\epsilon=0.05$, SNR: 24.4 dB, decision: Alice

adversarial voice 2, $\epsilon=0.01$, SNR: 24.6 dB, decision: Alice

adversarial voice 3, $\epsilon=0.005$, SNR: 21.8 dB, decision: Alice

adversarial voice 4, $\epsilon=0.004$, SNR: 24.7 dB, decision: Alice

adversarial voice 5, $\epsilon=0.003$, SNR: 23.8 dB, decision: Alice

adversarial voice 6, $\epsilon=0.002$, SNR: 27.9 dB, decision: Alice

adversarial voice 7, $\epsilon=0.001$, SNR: 33.4 dB, decision: Alice

Case 2: Inter-gender Attack ($\kappa=0$)

original voice, decision: Amy (female)

7 adversarial voices under different $\epsilon$:

adversarial voice 1, $\epsilon=0.05$, SNR: 14.1 dB, decision: Bob

adversarial voice 2, $\epsilon=0.01$, SNR: 14.9 dB, decision: Bob

adversarial voice 3, $\epsilon=0.005$, SNR: 20.0 dB, decision: Bob

adversarial voice 4, $\epsilon=0.004$, SNR: 21.8 dB, decision: Bob

adversarial voice 5, $\epsilon=0.003$, SNR: 24.5 dB, decision: Bob

adversarial voice 6, $\epsilon=0.002$, SNR: 27.6 dB, decision: Bob

adversarial voice 7, $\epsilon=0.001$, SNR: 36.8 dB, decision: Bob

Attack on Speaker Verification (SV)

Unlike OSI and CSi systems, a SV system only have one enrolled speakers. Given an input voice, a SV system checks whether this voice is uttered by the enrolled speaker, i.e., accept or reject.

Considering a SV system is enrolled by the speaker names Bob (male):

Case 1: Intra-gender Attack ($\epsilon=0.002$, $\kappa=0$)

original voice (from a male imposter), decision: reject

adversarial voice, decision: accept, SNR: 27.5 dB

Case 2: Inter-gender Attack ($\epsilon=0.002$, $\kappa=0$)

original voice (from a female imposter), decision: reject

adversarial voice, decision: accept, SNR: 37.1 dB

Over-the-air Attack

In this scenario, to launch an attack, the adversary has to play adversarial voices towards speaker recognition systems via loundspeakers. The adversarial voices are transmitted in the air and finally received by the built-in recorders of systems. The introduced environmental and electronic noise will disrupt the perturbation in the adversarial voice. To address this, we set relatively greater $\epsilon$ ($\epsilon=0.05$ or $\epsilon=0.1$) and increase the adversarial strength (i.e., confidence) of adversarial samples ($\kappa>0$).

We admit that in over-the-air attack scenario, the adversarial voices generated by our attack FAKEBOB are more noisy than those in API attack scenario. However, according to our human study, it is still hard for users to differentiate the identity of the original and adversarial voices.

Attack Microsoft Azure Speaker Recognition Platform

We enrolled five speakers (i.e., Alice, Mike, Sarah, Amy, Bob) to build an OSI system on Microsoft Azure via HTTP REST API which Microsoft Azure Speaker Recognition Platform provides. Note: our attack on Microsoft Azure is both over-the-air attack and transferability attack.

original voice 01 (from a male imposter), decision: reject

adversarial voice 01, decision: Bob (intra-gender attack, $\epsilon=0.1$)

original voice 02 (from a male imposter), decision: reject

adversarial voice 02, decision: Alice (inter-gender atatck, $\epsilon=0.05$)

Source Code

Our source code is available on github (Top) and gitee (Down).

Human Study Resources

To demonstrate the imperceptibility of adversarial samples, we conduct a human study on Amazon Mechanical Turk platform.

Specifically, we recruit participants from MTurk and ask them to choose one of the two tasks (i.e., Clean or Noisy, and Identify the speaker) and finish the corresponding questionnaire.

The voices we used in these two tasks are shown below.

All these audio files can be downloaded in *.zip format from url below:

https://drive.google.com/open?id=1ee136TS6TImgKv7uFIyaJdpvA1bLX-vY

Task 1: Clean or Noisy

Original voices

Adversarial voices which are only effective in API attack

Adversarial voices which remain effective when played over the air

Voices used for concentration testing (silent voices)

All of these three voices are silent voices. Only those questionnaires who select "Clean" for all these three voices are considered valid.

Task 2: Identify the speaker

Original pair

One pair in a row. The two voices of a pair are original voices from the same speaker but with different text.

Other pair

One pair in a row. The two voices of a pair are original voices from different speakers.

Adversarial (non air) pair

One pair in a row. One voice is original voice and the other is adversarial voice crafted from another original voice of the same speaker.

Adversarial (air) pair

One pair in a row. One voice is original voice and the other is adversarial voice crafted from another original voice of the same speaker.

Voice pair used for concentration testing

One pair in a row. One voice is from female speaker and the other is from male speaker.

Only those questionnaires who select "different" for all these these pairs are considered valid.