Text, Camera, Action!
Frontiers in Controllable Video Generation
Workshop @ ICML 2024, Vienna
Frontiers in Controllable Video Generation
Workshop @ ICML 2024, Vienna
About us 💃🕺
The first controllable video generation workshop (CVG) will be hosted in ICML2024 in Vienna, Austria.
The workshop focuses on exploring various modes of control in video generation; from specifying the content of a video with text 📄, to viewing a scene with different camera angles 📷, or even directing the actions of characters within the video 🏃.
Our aim is to showcase these diverse approaches and their applications, highlighting the latest advancements and exploring future directions in the field of video generation.
Speakers 🎙️
Andreas Blattman
Stability AI
Tali Dekel
Weizmann Institute of Science/Google
Sander Dieleman
Google DeepMind
Ashley Edwards
RunwayML
Boyi Li
Berkeley/NVIDIA
William (Bill) Peebles OpenAI
Schedule
09:00-09:05 Introduction and Opening Remarks
09:05-09:40 Andreas Blattman (Stability AI)
09:40-10:00 Oral presentation
Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices
10:00-10:30 Coffee break
10:30-11:10 Ashley Edwards (RunwayML)
Learning actions, policies, rewards, and environments from videos alone
In the past year we have seen a leap in the capabilities of video generation models. It is no surprise that this could be considered the next frontier — videos encompass much of the world we live in and learning from this data could get us ever closer to more generalist agents. However, video generation only scratches the surface of what we can learn from such data. In this talk, I will discuss a few different works that further investigate how we can infer actions, rewards, policies, and even environments from videos alone.
11:10-11:50 Tali Dekel (Weizmann Institute of Science/Google DeepMind)
The Future of Video Generation: Beyond Data and Scale
11:50-12:10 Oral presentation
Diverse and aligned audio-to-video generation via text-to-video model adaptation
12:10-13:30 Lunch Break
13:30-14:30 Poster session
14:30-15:10 Sander Dieleman (Google DeepMind)
Wading through the noise: an intuitive look at diffusion models
Diffusion models come in many shapes and forms, and research papers tend to describe this class of models in a variety of different ways, which can be confusing. In this talk, we'll look at the intuition behind diffusion models, and why they work so well for audiovisual generative modelling in particular, with a focus on video generation.
15:10-15:30 Oral presentation
EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning
15:30-15:50 Oral presentation
Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation
15:50-16:30 William (Bill) Peebles (OpenAI)
Video Generation Models as World Simulators
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
16:30-17:00 Boyi Li (Berkeley/NVIDIA)
Leveraging LLMs to Imagine Like Humans by Aligning Representations from Vision and Language
The machine learning community has embraced specialized models tailored to specific data domains. However, relying solely on a singular data type may constrain flexibility and generality, necessitating additional labeled data and limiting user interaction. Furthermore, existing content creation techniques often exhibit poor reasoning ability, even when trained with large datasets. To address these challenges, this talk will focus on building efficient intelligent systems that leverage language models to generate and edit images and videos, specifically in the areas of text-to-image and text-to-video generation. These findings effectively mitigate the limitations of current model setups and pave the way for multimodal representations that unify various signals within a single, comprehensive model.
17:00 Closing
Call for papers 📢
The past few years have seen the rapid development of Generative AI, with powerful foundation models demonstrating the ability to generate new, creative content in multiple modalities. Following breakthroughs in text and image generation, it is clear the next frontier lies in video. There has recently been remarkable progress in this domain, with state-of-the-art video generation models rapidly improving, generating visually engaging and aesthetically pleasing clips from a text prompt.
One challenging but compelling aspect unique to video generation is the various forms in which one could control such generation: from specifying the content of a video with text, to viewing a scene with different camera angles, or even directing the actions of characters within the video. We have also seen the use cases of these models diversify, with works that extend generation to 3D scenes, use such models to learn policies for robotics tasks or create an interactive environment for gameplay.
Given the great variety of algorithmic approaches, the rapid progress, and the tremendous potential for applications, we believe now is the perfect time to engage the broader machine learning community in this exciting new research area. The first ICML workshop on Controllable Video Generation (CVG) seeks to bring together a variety of different communities: from traditional computer vision, to safety and alignment, to those working on world models in a reinforcement learning or robotics setting.
We are accepting submissions on the following research areas:
Text-to-video models
Action-controllable video models
Style transfer and video editing
Camera pose control and 3D models
Methods to address safety, bias, ethical and copyright considerations.
With the following applications:
Video generation and editing.
Interactive experiences and games.
World models for agent training, robotics and autonomous driving.
Submit your work! 🧑💻
Organizers 👯
Michal Geyer
Weizmann Institute of Science
Jack Parker-Holder
Google DeepMind
UCL
Yuge (Jimmy) Shi
Google DeepMind
Trevor Darrell
UC Berkeley
Nando de Freitas
Google DeepMind
Antonio Torralba
MIT