Track 4: Text-Guided Video Editing

Introduction

Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge. Recent works in this field include Tune-A-Video, Gen-2, and Dreamix.

In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “people playing basketball in a gym,” and your model will edit the video to “dinosaurs playing basketball on the moon.”

With this competition, we aim to offer a place where researchers can rigorously compare video editing methods. After the competition ends, we hope the LOVEU-TGVE-2023 dataset can provide a standardized way of comparing AI video editing algorithms. The winning team will receive a $5000 USD prize.

[News] We have created a community on Slack for feedback and discussion about the competition. The link is https://join.slack.com/t/slack-hws9995/shared_invite/zt-1vi06nixs-nFqbQZed8l_KMqg7UWP4oA.

Quick start guide

Download this dataset of videos and prompts
Start with this baseline code to automatically edit the video dataset
Try some new ideas!
Submit your generated videos using this form: LOVEU-TGVE Registration & Submission Form

Dates

May 1, 2023: The competition data and baseline code become available.
May 8, 2023: The leaderboard and submission instructions become available.
June 5, 2023: Deadline for submitting your generated videos.
June 18, 2023: LOVEU 2023 Workshop. Presentations by winner and runner-up. $5000 prize will be paid to the winning team.

Evaluation method

To participate in the contest, you will submit the videos generated by your model. As you develop your model, you may want to visually evaluate your results and use automated metrics such as the CLIP score and PickScore to track your progress.

After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Specifically, we will have human labelers compare all submitted videos to the baseline videos that were edited with the Tune-A-Video model. Labelers will evaluate videos on the following criteria:

Text alignment: Which video better matches the caption?
Structure: Which video better preserves the structure of the input video?
Quality: Aesthetically, which video is better?

We will choose a winner and a runner-up based on the human evaluation results.

Dataset

We conducted a survey of text guided video editing papers, and we found the following patterns in how they evaluate their work:

Input: 10 to 100 videos, with ~3 editing prompts per video
Human evaluation to compare the generated videos to a baseline

We follow a similar protocol in our LOVEU-TGVE-2023 dataset. Our dataset consists of 76 videos. Each video has 4 editing prompts. All videos are creative commons licensed. Each video consists of either 32 or 128 frames, with a resolution of 480x480.

If you find our dataset useful for your research, please cite our paper:

@article{wu2023cvpr,

title={CVPR 2023 Text Guided Video Editing Competition},

author={Wu, Jay Zhangjie and Li, Xiuyu and Gao, Difei and Dong, Zhen and Bai, Jinbin and Singh, Aishani and Xiang, Xiaoyu and Li, Youzeng and Huang, Zuwei and Sun, Yuanxi and others},

journal={arXiv preprint arXiv:2310.16003},

year={2023}

}

Baseline code

GitHub repo: https://github.com/showlab/loveu-tgve-2023

The baseline code is a version of Tune-A-Video, provided by the authors of the Tune-A-Video paper.

Baseline videos can be downloaded from here.

Rules

We strongly encourage participants to contribute to the open-source community by sharing their solutions. However, we acknowledge that certain circumstances, such as commercial constraints, may preclude the release of code. In such cases, participants may submit their results alone, although we emphasize the value of openness and collaboration in advancing the field of text-guided video editing.
Be sure to follow the instructions in GitHub repo when saving your edited videos. This will help you get the right format and folder structure.
To submit your results, simply upload a .zip file and fill out the required information in the LOVEU-TGVE Registration & Submission Form.
If you have any questions, please feel free to reach out to us at loveu-tgve@googlegroups.com.

Report format

In your report, please explain clearly:

Your data, supervision, and any pre-trained models
Pertinent hyperparameters such as classifier-free guidance scale
If you used prompt engineering, please describe your approach

The report can be simple (1 page) or detailed (many pages). The report should be in PDF format.

FAQ

Q: Is prompt engineering allowed?
- A: Yes! If you want to add “high quality, 4k” or things like that to the prompts, you are welcome to do that!
Q: Can we make multiple submissions?
- A: There will be two types of evaluation: automated evaluation and human evaluation. For automated evaluation, you can submit many times. For human evaluation, which decides the competition's winners, we ask that each team limit it to two submissions.
Q: Do you have any hints?
- A: How about using a video diffusion model like VideoCrafter or Align your Latents or this one and tweak it so it does video editing?

People

The Text-Guided Video Editing Benchmark @ LOVEU 2023 was created by:

Jay Zhangjie Wu, Difei Gao, Jinbin Bai, Mike Shou (National University of Singapore)
Xiuyu Li, Zhen Dong, Aishani Singh, Kurt Keutzer (UC Berkeley)
Forrest Iandola (Meta)

Leaderboard

See this page for the leaderboard: https://huggingface.co/spaces/loveu-tgve/loveu-tgve-leaderboard

We update the leaderboard every 24 hours.

Page updated

Google Sites

Report abuse