AntiFake: Using Adversarial Audio to Prevent Unauthorized Speech Synthesis

Zhiyuan Yu, Shixuan Zhai, Ning Zhang
Washington University in St. Louis

Overview

AntiFake is developed to protect your voice.

The rapid development of deep neural networks and generative AI has catalyzed growth in realistic speech synthesis. While this technology has great potential to improve lives, it also leads to the emergence of DeepFake where synthesized speech can be misused to deceive humans and machines for nefarious purposes. In response to this evolving threat, there has been a significant amount of interest in mitigating this threat by DeepFake detection. Complementary to the existing work, we propose to take the preventative approach and introduce AntiFake, a defense mechanism that relies on "adversarial" perturbations to prevent unauthorized speech synthesis. By optimizing and applying noises to the original audio, the processed speech sample will still sound like the original speaker to humans; when it is used for speech synthesis, the resulting synthetic speech would resemble others' voices. Consequently, the generated DeepFake audio is less likely to deceive humans or machines for nefarious purposes. 

Why Do We Need AntiFake?

Unauthorized speech synthesis can cause problems.

On the one hand, speech synthesis has great potential to improve our lives...

Voice Service

Chatbot

Broadcast

Voice Asssitant

However, it can also be used for nefarious purposes. It is reported that such tools are already used to conduct financial scams, spread misinformation, bypass speech-based authentication systems, or gain commercially by infringing on voice actors' rights☹️

Some examples that show the astonishing performance of contemporary synthesizers are as follows. 

Source

Synthesized

Speaker 1

source.wav
source_el.mp3

Speaker 2

source.wav
source_el.mp3

Speaker 3

source.wav
source_el.mp3

Can we proactively protect our voice by minimally changing the voice pieces available to attackers?

How Does AntiFake Work?

Optimize minor noises that can disrupt synthesizers.

You are a user and try to protect your voice...


Technically, the key that synthetic speech resembles human voice lies in the speaker embedding, which is generally extracted via neural-network-based encoders

As we aim to disrupt the synthesis process, we need to shift the embedding in the attacker's model. 





For more details of our approach please see our paper.







Demo Audio

We applied AntiFake to both existing speech corpus and samples collected from participants to test its efficacy.

(Note: the embedded links with the format of "us?export" no longer work on Google Sites from Jan 2024, see similar issues here and here. People are still waiting a bit for them to recover before significant changes are needed. Please reach out if you need these demo audio clips.)

Librispeech Speakers

Human Testers

If you find this work helpful, please cite us at:


@inproceedings{yu2023antifake,

  title={AntiFake: Using Adversarial Audio to Prevent Unauthorized Speech Synthesis},

  author={Yu, Zhiyuan and Zhai, Shixuan and Zhang, Ning},

  booktitle={Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security},

  year={2023}

}



Our source code is available at:

Questions?

More details please see our paper to appear on ACM CCS 2023, or contact us.