The Efficient Online Continual Visual Instruction Tuning (CVIT) Challenge aims to accelerate the development of efficient, adaptive, and memory-constrained continual learning systems that can generalize across evolving visual instruction tasks. This challenge emphasizes real-world scenarios in which models must process sequences of visual instruction-following tasks without catastrophic forgetting, all while maintaining efficiency in both training and inference. We specifically designed the benchmark to fine-tune and evaluate multi-modal large language models (MLLMs), encouraging practical advances toward deployable, continually learning vision-language systems.
We have total 64 participants, 203 submissions, and 15 teams engaged in the first phase of the competition. We sincerely thank all participants for their enthusiastic engagement and contributions throughout the competition.Â
For the final evaluation phase, we re-evaluated the 8 top-performing teams on an unseen upstream scenario to assess the generalization and adaptability of their proposed methods. To ensure fairness in the computational efficiency comparison, we adjusted the training iteration hyperparameters of each approach to align their wall-clock runtimes.
The final ranking of the top-5 teams reflects a balanced consideration of task performance, memory efficiency, and runtime throughput, highlighting the most promising strategies for scalable and adaptive continual visual instruction tuning.
Videos of the workshop talk and the solutions TBA
Junzhou Xu, Zijia An, Ruiqi Liu, Xi Zhang, Runjie Shao, Boyu Diao
Institute of Computing Technology, Chinese Academy of Sciences
Xiaojin Hua, Guobang Li
HUA Innovation High Tech (Hangzhou) Co., Ltd.
Cheng-Hao Tu
Independent researcher
Xu Li, Fan Lyu
Northeastern University, Institute of Automation, Chinese Academy of Sciences
Sungmin Lim, Jieun Park
AIMMO