AVGenL: Audio-Visual Generation and Learning
ECCV 2024 Workshop
Sep. 29 2024, Milano, Italy
Room: Suite 9
14:00 ~ 18:00
ECCV 2024 Workshop
Sep. 29 2024, Milano, Italy
Room: Suite 9
14:00 ~ 18:00
Image is generated by ©DALL·E
In recent years, we have witnessed significant advancements in the field of visual generation which have molded the research landscape presented in computer vision conferences such as ECCV, ICCV, and CVPR. However, in a world where information is conveyed through a rich tapestry of sensory experiences, the fusion of audio and visual modalities has become much more essential for understanding and replicating the intricacies of human perception and diverse real-world applications. Indeed, the integration of audio and visual information has emerged as a critical area of research in computer vision and machine learning, having numerous applications across various domains, from immersive gaming environments to lifelike simulations for medical training, such as multimedia analysis, virtual reality, advertisement and cinematic application.
Despite these strong motivations, little attention has been given to research focusing on understanding and generating audio-visual modalities compared to traditional, vision-only approaches and applications. Given the recent prominence of multi-modal foundation models, embracing the fusion of audio and visual data is expected to further advance current research efforts and practical applications within the computer vision community, which makes this workshop an encouraging addition to ECCV that will catalyze advancements in this burgeoning field.
In this workshop, we aim to shine a spotlight on this exciting yet under-investigated field by prioritizing new approaches in audio-visual generation, as well as covering a wide range of topics related to audio-visual learning, where the convergence of auditory and visual signals unlocks a plethora of opportunities for advancing creativity, understanding, and also machine perception. We hope our workshop can bring together researchers, practitioners, and enthusiasts from diverse disciplines in both academia and industry to delve into the latest developments, challenges, and breakthroughs in audio-visual generation and learning.
Sony and SonyAI are organizing a challenge for sounding video generation. More details could be found in this website, more details will be released in that challenge website soon.
The workshop will mainly cover the topics presented below.
Audio-visual generation (such as audio-visual mutual/conditional generation, sounding video and talking head, etc.)
Audio-visual foundation model
Audio-visual representation learning and transfer learning
Audio-visual learning application (scene understanding, localization, etc.)
Ethical considerations in audio-visual research
We invite 2 types of submissions: 1) workshop papers which should follow anonymous ECCV template (ideally around 8 pages, but not mandatory) not exceeding 14 pages (excluding reference), 2) extended abstracts (shorter than the equivalent of 4 pages excluding reference in CVPR template format, and not be considered a publication). The reviewing process is double-blind.
Submission site: openreview link
Timeline:
Paper Submission Deadline 15 Jul 2024, 2:00 pm CET (UTC +2)
Paper Notification to Authors Aug. 9, 2024 Aug. 11, 2024
Paper Camera Ready Deadline Aug. 16, 2024
Tentative
(timezone: Central European Time)
14:00 - 14:10 Opening remarks, welcome
14:10 - 14:45 Invited talk 1: Prof. Sangpil Kim [slides]
14:45 - 15:20 Invited talk 2: Prof. Chuang Gan
15:20 - 16:05 Poster Session and Coffee Break
16:05 - 16:40 Invited talk 3: Prof. Andrew Owens
16:40 - 17:15 Invited talk 4: Dr. Konstantinos Vougioukas (virtually) [slides]
17:15 - 17:50 Invited talk 5: Prof. Chenliang Xu [slides]
17:50 - 18:00 Closing remarks
(Full paper) AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition. Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko
(Extended abstract) A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation. Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
(Full paper) FastTalker: Jointly Generating Speech and Conversational Gestures from Text. Zixin Guo, Zhang Jian
(Full paper) Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification. Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain
(Full paper) CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling. Ruihan Yang, Hannes Gamper, Sebastian Braun
(Full paper) Unveiling Visual Biases in Audio-Visual Localization Benchmarks. Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin
(CVPR24) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners. Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Che
(ECCV24) Audio-Synchronized Visual Animation. Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado
(ECCV24) Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity. Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
(ECCV24) Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time. Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
(ECCV24) TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting. Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu
(ECCV24) Modeling and Driving Human Body Soundfields through Acoustic Primitives. Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard