Overview
Voice controllable systems rely on speech recognition and speaker identification as the key enabling technologies. While they bring revolutionary changes to our daily lives, their security has become a growing concern. Existing work has demonstrated the feasibility of using maliciously crafted perturbations to manipulate speech or speaker recognition. Although these attacks vary in targets and techniques, they all require the addition of noise perturbations. While these perturbations are generally restricted to Lp-bounded neighborhood, the added noises inevitably leave unnatural traces recognizable by humans, and can be used for defense. To address this limitation, we introduce a new class of adversarial audio attack, named Semantically Meaningful Adversarial Audio AttaCK (SMACK), where the inherent speech attributes (such as prosody) are modified such that they still semantically represent the same speech and preserves the speech quality. The efficacy of SMACK was evaluated against five transcription systems and two speaker recognition systems in a black-box manner. By manipulating semantic attributes, our adversarial audio examples are capable of evading the state-of-the-art defenses, with better speech naturalness compared to traditional Lp-bounded attacks in the human perceptual study.
What is the Semantic Audio Attack?
Above: Traditional adversarial audio attack that relies on Lp-bounded noises that could raise suspicion from the victim
Below: Our attack crafts perturbations in a semantically meaningful way, by optimizing the inherent features in speech.
Manipulating Prosodic Features for Attacks
Prosody features are inherent to human speech, independent of speaker identity and speech content.
It encompasses multiple characteristics of a speech, including pitch contour or intonation of an utterance, the length of a syllable, the loudness of a word, etc
We built on state-of-the-art speech generative models to control prosody, such an inherent and complex feature in speech. The speaker identity and speech content are constrained to remain the same as the original speech.
Then we developed a novel optimization framework AGA-ES to optimize the variable-length prosody embedding. More details please see our paper.
Demo Audio Clips
Original: Wipe the floor
Transcription: Open the door
Original: Wipe the floor
Transcription: I want to share
We have a set of adversarial examples generated by SMACK for future research. Please contact us to obtain them. We released code on GitHub:
Demo Attacks against Commercial Voice Assistant - Amazon Echo
Benign original speech without semantic perturbations: "Election is important to her career."
Alexa: [No Response]
Semantic adversarial audio (to Humans): "Election is important to her career."
Semantic adversarial audio (to Alexa): "Alexa, how to prevent dysarthria."
Alexa: "Here's something I found on Sharecare. Dysarthria is a speech impediment that can result from a variety of underlying conditions..."
Questions?
For more details please see our paper on USENIX Security 2023, or contact us.