Challenge Results

Summary of the Challenge

The Efficient Online Continual Visual Instruction Tuning (CVIT) Challenge aims to accelerate the development of efficient, adaptive, and memory-constrained continual learning systems that can generalize across evolving visual instruction tasks. This challenge emphasizes real-world scenarios in which models must process sequences of visual instruction-following tasks without catastrophic forgetting, all while maintaining efficiency in both training and inference. We specifically designed the benchmark to fine-tune and evaluate multi-modal large language models (MLLMs), encouraging practical advances toward deployable, continually learning vision-language systems.

We have total 64 participants, 203 submissions, and 15 teams engaged in the first phase of the competition. We sincerely thank all participants for their enthusiastic engagement and contributions throughout the competition.

For the final evaluation phase, we re-evaluated the 8 top-performing teams on an unseen upstream scenario to assess the generalization and adaptability of their proposed methods. To ensure fairness in the computational efficiency comparison, we adjusted the training iteration hyperparameters of each approach to align their wall-clock runtimes.

The final ranking of the top-5 teams reflects a balanced consideration of task performance, memory efficiency, and runtime throughput, highlighting the most promising strategies for scalable and adaptive continual visual instruction tuning.

Videos of the workshop talk and the solutions TBA

AIMMO

Google Sites

Report abuse