Track 4: Text-Guided Video Editing


Introduction

Leveraging AI for video editing has the potential to unleash creativity for artists across all skill levels. The rapidly-advancing field of Text-Guided Video Editing (TGVE) is here to address this challenge. Recent works in this field include Tune-A-Video, Gen-2, and Dreamix


In this competition track, we provide a standard set of videos and prompts. As a researcher, you will develop a model that takes a video and a prompt for how to edit it, and your model will produce an edited video. For instance, you might be given a video of “people playing basketball in a gym,” and your model will edit the video to “dinosaurs playing basketball on the moon.”


With this competition, we aim to offer a place where researchers can rigorously compare video editing methods. After the competition ends, we hope the LOVEU-TGVE-2023 dataset can provide a standardized way of comparing AI video editing algorithms. The winning team will receive a $5000 USD prize.

[News] We have created a community on Slack for feedback and discussion about the competition. The link is https://join.slack.com/t/slack-hws9995/shared_invite/zt-1vi06nixs-nFqbQZed8l_KMqg7UWP4oA.

Quick start guide

Dates

Evaluation method

To participate in the contest, you will submit the videos generated by your model. As you develop your model, you may want to visually evaluate your results and use automated metrics such as the CLIP score and PickScore to track your progress.


After all submissions are uploaded, we will run a human-evaluation of all submitted videos. Specifically, we will have human labelers compare all submitted videos to the baseline videos that were edited with the Tune-A-Video model. Labelers will evaluate videos on the following criteria:


We will choose a winner and a runner-up based on the human evaluation results.

Dataset

We conducted a survey of text guided video editing papers, and we found the following patterns in how they evaluate their work:


We follow a similar protocol in our LOVEU-TGVE-2023 dataset. Our dataset consists of 76 videos. Each video has 4 editing prompts. All videos are creative commons licensed. Each video consists of either 32 or 128 frames, with a resolution of 480x480.

If you find our dataset useful for your research, please cite our paper:

@article{wu2023cvpr,

  title={CVPR 2023 Text Guided Video Editing Competition},

  author={Wu, Jay Zhangjie and Li, Xiuyu and Gao, Difei and Dong, Zhen and Bai, Jinbin and Singh, Aishani and Xiang, Xiaoyu and Li, Youzeng and Huang, Zuwei and Sun, Yuanxi and others},

  journal={arXiv preprint arXiv:2310.16003},

  year={2023}

}

Baseline code

GitHub repo: https://github.com/showlab/loveu-tgve-2023

The baseline code is a version of Tune-A-Video, provided by the authors of the Tune-A-Video paper.

Baseline videos can be downloaded from here.

Rules

Report format

In your report, please explain clearly:

The report can be simple (1 page) or detailed (many pages). The report should be in PDF format.

FAQ

People

The Text-Guided Video Editing Benchmark @ LOVEU 2023 was created by:

Leaderboard

See this page for the leaderboard: https://huggingface.co/spaces/loveu-tgve/loveu-tgve-leaderboard


We update the leaderboard every 24 hours.