VIDEO-TO-AUDIO GENERATION WITH HIDDEN ALIGNMENT


Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu

Tencent AI Lab

Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model which is built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide key insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities that can advance the challenge of generating synchronized audio from both semantic and temporal perspectives. It is our hope that these insights will serve as a stepping stone towards the development of more realistic and accurate audio-visual generation models.

Some DEMOS:

Code & Checkpoint:

https://github.com/ariesssxu/vta-ldm

https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large

Cite us:

@misc{xu2024vta-ldm,  

      title={Video-to-Audio Generation with Hidden Alignment},   

      author={Manjie Xu and Chenxing Li and Yong Ren and Rilin Chen and Yu Gu and Wei Liang and Dong Yu},

      year={2024},

      eprint={2407.07464},

      archivePrefix={arXiv},

      url={https://arxiv.org/abs/2407.07464}, 

}

This is not an officially supported product by Tencent Ltd.

Connect with us: manjietsu@gmail.com