Prompt-guided Precise Audio Editing with Diffusion Models


Manjie Xu, Chenxing Li*, Duzhen zhang, Dan Su, Wei Liang*, Dong Yu*

*corresponding author


Beijing Institute of Technology, Tencent AI Lab Beijing, Tencent AI Lab Seattle

Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as Prompt-guided Precise Audio Editing (PPAE), which serves as a general module for diffusion models and enables precise audio editing. The editing is based on the input textual prompt only and is entirely trainingfree. We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process. Experimental results highlight the effectiveness of our method in various editing tasks.

Given the edit instruction, the source audio will first be inverted into the given diffusion model’s domain, and then edited on the attention-map level under the guidance of our editing controller. The controller accomplishes precise editing by utilizing hierarchical guidance throughout the diffusion process. The whole editing pipeline is training-free and is adaptable to common diffusion models.

Some DEMOS:

(More demos can be found in the original paper materials.)

Audio Replace: 

a man talking and a soft music_0.wav
a woman talking and a soft music_1.wav
a woman talking and a soft music_2.wav
a man talking and a baby crying_0.wav
a woman talking and a baby crying_1.wav
a woman talking and a baby crying_2.wav

Audio refine: 

a piece of music_0.wav
a piece of rock music_1.wav
a piece of rock music_2.wav
a dog barking and raining_0.wav
a dog barking and raining heavily_2.wav
a dog barking and raining heavily_1.wav

Audio Reweight

someone talking and a dog barking_0.wav
someone talking and a dog barking_1.wav
someone talking and a dog barking_2.wav
the water flowing and a dog barking_0.wav
the water flowing and a dog barking_1.wav
the water flowing and a dog barking_2.wav

Code and dataset comming soon. 

(Due to some policies of the institute, we cannot release the code at once. Welcome to connect with me in case you need any help.)

To  cite us:

@inproceedings{xu2024prompt,

  title={Prompt-guided Precise Audio Editing with Diffusion Models},

  author={Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu},

  booktitle={ICML},

  year={2024}

}