Text, Camera, Action!
Frontiers in Controllable Video Generation
Workshop @ ICML 2024, Vienna

About us 💃🕺

The first controllable video generation workshop (CVG) will be hosted in ICML2024 in Vienna, Austria.

The workshop focuses on exploring various modes of control in video generation; from specifying the content of a video with text 📄, to viewing a scene with different camera angles 📷, or even directing the actions of characters within the video 🏃.

Our aim is to showcase these diverse approaches and their applications, highlighting the latest advancements and exploring future directions in the field of video generation.

Speakers 🎙️

Andreas Blattman
Stability AI

Tali Dekel
Weizmann Institute of Science/Google

Sander Dieleman
Google DeepMind

Ashley Edwards
RunwayML

Boyi Li
Berkeley/NVIDIA

William (Bill) Peebles OpenAI

Schedule

09:00-09:05 Introduction and Opening Remarks

09:05-09:40 Andreas Blattman (Stability AI)

09:40-10:00 Oral presentation

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

10:00-10:30 Coffee break

10:30-11:10 Ashley Edwards (RunwayML)

Learning actions, policies, rewards, and environments from videos alone

In the past year we have seen a leap in the capabilities of video generation models. It is no surprise that this could be considered the next frontier — videos encompass much of the world we live in and learning from this data could get us ever closer to more generalist agents. However, video generation only scratches the surface of what we can learn from such data. In this talk, I will discuss a few different works that further investigate how we can infer actions, rewards, policies, and even environments from videos alone.

11:10-11:50 Tali Dekel (Weizmann Institute of Science/Google DeepMind)

The Future of Video Generation: Beyond Data and Scale

11:50-12:10 Oral presentation

Diverse and aligned audio-to-video generation via text-to-video model adaptation

12:10-13:30 Lunch Break

13:30-14:30 Poster session

14:30-15:10 Sander Dieleman (Google DeepMind)

Wading through the noise: an intuitive look at diffusion models

Diffusion models come in many shapes and forms, and research papers tend to describe this class of models in a variety of different ways, which can be confusing. In this talk, we'll look at the intuition behind diffusion models, and why they work so well for audiovisual generative modelling in particular, with a focus on video generation.

15:10-15:30 Oral presentation

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

15:30-15:50 Oral presentation

Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation

15:50-16:30 William (Bill) Peebles (OpenAI)

Video Generation Models as World Simulators

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

16:30-17:00 Boyi Li (Berkeley/NVIDIA)

Leveraging LLMs to Imagine Like Humans by Aligning Representations from Vision and Language

The machine learning community has embraced specialized models tailored to specific data domains. However, relying solely on a singular data type may constrain flexibility and generality, necessitating additional labeled data and limiting user interaction. Furthermore, existing content creation techniques often exhibit poor reasoning ability, even when trained with large datasets. To address these challenges, this talk will focus on building efficient intelligent systems that leverage language models to generate and edit images and videos, specifically in the areas of text-to-image and text-to-video generation. These findings effectively mitigate the limitations of current model setups and pave the way for multimodal representations that unify various signals within a single, comprehensive model.

17:00 Closing

Call for papers 📢

The past few years have seen the rapid development of Generative AI, with powerful foundation models demonstrating the ability to generate new, creative content in multiple modalities. Following breakthroughs in text and image generation, it is clear the next frontier lies in video. There has recently been remarkable progress in this domain, with state-of-the-art video generation models rapidly improving, generating visually engaging and aesthetically pleasing clips from a text prompt.

One challenging but compelling aspect unique to video generation is the various forms in which one could control such generation: from specifying the content of a video with text, to viewing a scene with different camera angles, or even directing the actions of characters within the video. We have also seen the use cases of these models diversify, with works that extend generation to 3D scenes, use such models to learn policies for robotics tasks or create an interactive environment for gameplay.

Given the great variety of algorithmic approaches, the rapid progress, and the tremendous potential for applications, we believe now is the perfect time to engage the broader machine learning community in this exciting new research area. The first ICML workshop on Controllable Video Generation (CVG) seeks to bring together a variety of different communities: from traditional computer vision, to safety and alignment, to those working on world models in a reinforcement learning or robotics setting.

We are accepting submissions on the following research areas: