All data are provided for research purposes only under the CC BY-NC license.
We appreciate your understanding and cooperation.
In this challenge, we introduce a new video caption dataset for Intention-Oriented Controllable Video Captioning, based on the LaSoT dataset. This dataset is provided for research purposes only.
The dataset consists of 70 classes and is structured as follows:
IntentVC/
├── airplane/
│ ├── airplane-1/
│ │ ├── airplane-1.mp4
│ │ ├── object_bboxes.txt
│ ├── ...
The video FPS is set to 1.
Each video is accompanied by a corresponding object_bboxes.txt file.
If an object moves out of view, its bounding box is recorded as 0,0,0,0.
The captions for videos are saved in a .json file.
🛬 An example of an airplane video from the dataset used in the challenge.
Original video
Video with a focused target object
(Highlighted in green bounding boxes)
Example ground-truth caption:
{
"airplane-1": [
"small airplane approaches runway surrounded by hills and trees on a clear day",
"white airplane approaches runway and lands safely on the tarmac with mountains in the background",
"a white airplane descends onto the runway with hills and trees in the background",
"a small white airplane descends towards a runway amid the mountainous terrain",
"white airplane descends towards runway amidst mountainous landscape and onlookers nearby"
]
}
If you use this dataset, please cite our challenge and the original LaSoT paper.
@inproceedings{intentvc2025,
title = {IntentVC 2025: The ACM Multimedia Grand Challenge on Intention-Oriented Controllable Video Captioning},
author = {Komamizu, Takahiro and Kastner, Marc A. and Kawanishi, Yasutomo and Nguyen, Trung Thanh and Chen, Junan},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
year = {2025},
pages = {1--2},
}
@inproceedings{fan2019lasot,
title = {LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking},
author = {Fan, Heng and Lin, Liting and Yang, Fan and Chu, Peng and Deng, Ge and Yu, Sijia and Bai, Hexin and Xu, Yong and Liao, Chunyuan and Ling, Haibin},
booktitle = {Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2019},
pages = {5374--5383},
}