AudioShield

Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems

(Accepted to USENIX Security '25)

Overview

The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), two LLM-empowered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in the over-the-air scenario against three widely-used voice assistants, and demonstrates strong resilience against adaptive countermeasures.

Paper

Code

What is AudioShield used for?

As shown in the figure, the large-scale speech communication surveillance typically includes three parties, speaker, receiver, and eavesdropper. While the speaker conveys speech to the receiver, the eavesdropper can seize the opportunity to intercept large amounts of user speech data and use ASR to convert them into texts for quick extraction of key information. More seriously, due to the lack of any protection for speech, the speaker and receiver may never know and never come to know that their conversations are being monitored. In summary, such unprotected conversations provide the possibility of privacy content leakage to third parties.

AudioShield converts each normal speech input into an adversarial example, making its semantic content transcribed incorrectly by ASR systems, thereby preventing large-scale surveillance and protecting the privacy of users' speech.

Demo Audio Clips

Original Text: I've not said anything to them, they know.

Neekhara et al.

NEEK_1.wav

Transcription: I've not said anything to them they know

Zong et al.

ZONG_1.wav

Transcription: Has not said anything to them they know

AdvDDoS

ADV_1.wav

Transcription: I've not said anything to them they know

Ours

SUAP_1.wav

Transcription: No I don't know who had anything to do with

Original Text: One season, they might do well.

Neekhara et al.

NEEK_2.wav

Transcription: 1 season they might do well

Zong et al.

ZONG_2.wav

Transcription: They might be well

AdvDDoS

ADV_2.wav

Transcription: 1 piece and you might be well

Ours

SUAP_2.wav

Transcription: Most of the time

Visualization

Page updated

Google Sites

Report abuse

Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems

Overview

What is AudioShield used for?

Demo Audio Clips

Visualization

Contact