VIDEO-TO-AUDIO GENERATION WITH HIDDEN ALIGNMENT
Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu
Tencent AI Lab
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model which is built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide key insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities that can advance the challenge of generating synchronized audio from both semantic and temporal perspectives. It is our hope that these insights will serve as a stepping stone towards the development of more realistic and accurate audio-visual generation models.
Some DEMOS:
Code & Checkpoint:
https://github.com/ariesssxu/vta-ldm
https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large
Cite us:
@misc{xu2024vta-ldm,
title={Video-to-Audio Generation with Hidden Alignment},
author={Manjie Xu and Chenxing Li and Yong Ren and Rilin Chen and Yu Gu and Wei Liang and Dong Yu},
year={2024},
eprint={2407.07464},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2407.07464},
}
This is not an officially supported product by Tencent Ltd.
Connect with us: manjietsu@gmail.com