Track 2A: Text-Guided Video Editing

Introduction

Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge.

In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “people playing basketball in a gym,” and your model will edit the video to “dinosaurs playing basketball on the moon.”

With this competition, we aim to offer a place where researchers can rigorously compare video editing methods. After the competition ends, we hope the LOVEU-TGVE-2024 dataset can provide a standardized way of comparing AI video editing algorithms. The winning team will receive a $2000 USD prize.

Quick start guide

Download this dataset of videos and prompts
Start with existing video editing baselines to automatically edit the video dataset
Try some new ideas!
Submit your generated videos using this form: LOVEU-TGVE Registration & Submission Form

Dates

May 15, 2024: The competition data and baseline code become available.
May 22, 2024: The leaderboard and submission instructions become available.
June 8, 2024: Deadline for submitting your generated videos.
June 17, 2024: LOVEU 2024 Workshop. Presentations by winner and runner-up. $2000 prize will be paid to the winning team.

Evaluation method

To participate in the contest, you will submit the videos generated by your model. As you develop your model, you may want to visually evaluate your results and use automated metrics such as the CLIP score and PickScore to track your progress.

After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Labelers will evaluate videos on the following criteria:

Text alignment: Which video better matches the edit caption?
Structure: Which video better preserves the structure of the input video?
Quality: Aesthetically, which video is better?

We will choose a winner and a runner-up based on the human evaluation results.

Dataset

Our LOVEU-TGVE-2024 dataset consists of 200 videos spanning 5 categories (Animal, Food, Scenery, Sport Activity, and Vehicle).

Each video has 6 editing prompts:

Object Insertion: Adds new objects to the video.
Object Removal: Removes existing objects from the video.
Object Change: Replaces an object while preserving its motion.
Scene Change: Alters the setting of the video.
Motion Change: Modifies the motion of objects or the camera.
Stylization: Applies a specific style to the video.

The videos are sourced from Panda-70M dataset. They vary in duration from 2s to 48s.

Rules

We strongly encourage participants to contribute to the open-source community by sharing their solutions. However, we acknowledge that certain circumstances, such as commercial constraints, may preclude the release of code. In such cases, participants may submit their results alone, although we emphasize the value of openness and collaboration in advancing the field of text-guided video editing.
Be sure to follow the instructions (TBA) when saving your edited videos. This will help you get the right format and folder structure.
To submit your results, simply upload a .zip file and fill out the required information in the LOVEU-TGVE Registration & Submission Form.
If you have any questions, please feel free to reach out to us at loveu-tgve@googlegroups.com.

Report format

In your report, please explain clearly:

Your data, supervision, and any pre-trained models
Pertinent hyperparameters such as classifier-free guidance scale
If you used prompt engineering, please describe your approach

The report can be simple (1 page) or detailed (many pages). The report should be in PDF format.

FAQ

Q: Is prompt engineering allowed?
- A: Yes! If you want to add “high quality, 4k” or things like that to the prompts, you are welcome to do that!
Q: Can we make multiple submissions?
- A: There will be two types of evaluation: automated evaluation and human evaluation. For automated evaluation, you can submit many times. For human evaluation, which decides the competition's winners, we ask that each team limit it to two submissions.
Q: Do you have any hints?
- A: There are many open-sourced video editing methods you may start with, e.g., Tune-A-Video, MotionDirector, VideoSwap, Rerender A Video, Ground-A-Video, etc.

Organizers

Jay Zhangjie Wu

NUS

Guian Fang

NUS

Forrest Iandola