To promote the practical utility and real-world deployment of continual instruction-tuned vision-language models, we are organizing the Efficient Online Continual Visual Instruction Tuning (CVIT) Challenge as part of the ICCV 2025 6th CLVISION Workshop.
The primary goal is to benchmark and accelerate the development of efficient, adaptive, and memory-constrained continual learning systems that can generalize across evolving visual instruction tasks. This challenge focuses on real-world scenarios where models are required to process sequences of visual instruction-following tasks without catastrophic forgetting, while remaining efficient in training and inference.
Join us in exploring the frontier of continual learning in multimodal AI systems!
Strategies can be submitted at the Codabench competition: https://www.codabench.org/competitions/9421/
The challenge DevKit can be accessed on GitHub: https://github.com/clvision2025-challenge/clvision-challenge-2025
The pre-selection phase will run until Aug 31st 2025.
<2025-Jul-8th>: The correct version of 'create_submission_file.py' is uploaded to Devkit GitHub repository. Please use the latest version to create submission file.
The goal of this challenge is to evaluate and advance continual visual instruction tuning methods that effectively adapt to new visual-language tasks while retaining previously learned knowledge. Each scenario in the challenge consists of four sequentially introduced upstream tasks, which models incrementally learn. After continual learning on these upstream tasks, participants must quickly adapt their final models to a previously unseen downstream task by fine-tuning for only one epoch. Participants are expected to design efficient and memory-constrained systems capable of continual tuning. This challenge tests models' abilities to continually learn from visual instruction streams without forgetting, while efficiently generalizing and rapidly adapting to newly introduced tasks.
Participants are asked to develop strategies that, after training incrementally on all four upstream tasks, achieve high accuracy on both the evaluation sets of these upstream tasks and the evaluation set of the newly introduced downstream task. The submitted approach should utilize the same continual learning and quick-adaptation strategy across all provided scenarios, ensuring consistency and robustness of the method under diverse visual-language task conditions. Participants must carefully balance model effectiveness with efficiency and memory usage, ensuring their methods can be feasibly deployed in resource-constrained environments.
Participants are challenged to develop new strategies using the provided DevKit. The challenge will be articulated in two different phases:
Pre-selection: participants are asked to run their strategies and provide predictions on their machines. Those results are submitted to the Codabench platform and the score is calculated and uploaded to the leaderboard.
Final evaluation: the top five strategies with the highest average test accuracy will be evaluated on novel scenarios, similar to the ones provided in the competition. For that, the participants will be asked to share the code for their strategies with the organizers to assess that no foul play was involved. These variations are intended to test the robustness of the strategies submitted. The top strategy and the final ranking will be announced.
The top strategies might be asked to submit a report and prepare a (short) presentation to be given during the workshop. Report papers may optionally be asked from teams that have submitted interesting solutions, even among non-winning ones.
Participants must implement their strategies on DevKit. Changing the data loading process and competition-related modules is not permitted.
For Pre-selection phase, participants need to validate their strategies on two upstream scenarios based on a series of vision-language tasks:
Scenario 1 - Intra-dataset Continual Learning: contains 4 upstream tasks split from a single vision-language task, where each task contains 4,000 train samples and 1,000 test samples.
Scenario 2 - Cross-task Continual Learning: contains 4 upstream tasks from multiple vision-language tasks, where each task contains 40,000 train samples and 5,000 test samples.
For Final evaluation, the top performing participants' strategies will be evaluated on another upstream scenario, Scenario 3, that contains 6 upstream tasks from multiple vision-language tasks, where the number of samples per task is imbalanced.
The downstream tasks are the same across all the scenarios. 4 different downstream tasks are provided, where each task contains 500 train samples and 2,000 test samples. Fine-tuning to downstream task must be ran for one epoch.
Upstream task test sets must be evaluated on the final model obtained after continual learning on all four upstream tasks, which is initialized from the base model (LLaVA v1.5).
Downstream task test sets must be evaluated on models adapted from the final upstream models by fine-tuning them separately for each downstream task, for only one epoch. Therefore, participants need to create and evaluate four adapted models, each specifically fine-tuned for one downstream task.
The leaderboard ranking is based solely on performance in Scenario 1 and 2.
Submissions are limited to 2 per day until 10 days before the deadline, after which the limit increases to 3 per day. Using multiple accounts on Codabench to increase the number of submissions is prohibited.
Competition Rules
Model: all participants must use the LLaVA-v1.5 7B provided in the DevKit as the initial model for their upstream tasks continual fine-tuning. However, there are no restrictions on fine-tuning methods, e.g. using LoRA or Q-LoRA, as long as they do not exceed the maximum GPU memory allowed for the competition.
Continual Learning: This challenge adopts an online continual learning setup. Unlike offline continual learning, training in this challenge is strictly limited to a single-epoch equivalent, reflecting realistic deployment scenarios. Participants are allowed to modify the number of online iteration steps for training within each task. However, it should be kept in mind that the final evaluation will explicitly consider the wall-clock time for fair comparison, balancing computational cost with model effectiveness. See 'Final Ranking Decision' below for more detail.
Submission: for each submission, the predictions for the two scenarios must come from the same strategy. The strategy must be able to solve all scenarios without having a scenario-ID, since it will have to work fine on the novel scenarios in the final phase. No data from external sources can be used.
Replay Buffer: we allow a memory-infinite setup, assuming all samples (e.g.., image, text, logit, features) can be stored in the replay buffer, reflecting that memory cost is not a bottleneck in real-world scenarios. However, model parameters can be stored up to 2Gb in the replay buffer.
Hardware Limitations
Number of GPUs: Participants are allowed to use 1 GPU for training only.
Max GPU Memory Usage: 48000 MB (NVIDIA RTX A6000)
Evaluation of the submitted strategies during the final phase will be done in a machine with an NVIDIA RTX A6000 (48Gb GPU) with 256Gb of RAM and 32 CPU cores. Any strategies that fail to run in this system will be regarded as invalid.
Submissions are evaluated based on A_last accuracy, computed separately for upstream and downstream performance.
For each of the four submission files, we extract the predicted answer and compare it against ground truth using exact match.
The evaluation process involves the following steps:
From the submitted JSON file,
Parse the input field to extract valid answer choices (e.g., Choice list: [A, B, C, D])
Infer the selected answer from the sentence field
Compare the selected answer with the ground truth answer
Accuracy is calculated as the percentage of correctly predicted answers for each JSON file. Then, the following scores are computed:
Upstream A_last: Average accuracy over scenario1_upstream and scenario2_upstream (Rouge-L score for scenario 2)
Downstream A_last: Average accuracy over scenario1_downstream and scenario2_downstream
Average A_last: Mean of the upstream and downstream scores
These three values are displayed on the leaderboard.
For more information on the format of the prediction submission, check the details in DevKit and leaderboard description.
For Final evaluation, the top performing participants' strategies will be evaluated on another upstream scenario.
The final ranking of the top-5 teams will be determined by conducting a fair computational efficiency comparison. Selected teams will be required to submit their code, and the organizers will generate wall-clock time (wall-clock time of training + inference) vs. accuracy (Average A_last) plots by varying iteration hyperparameter in training. The final rankings will be based on comparing the accuracy of all submitted models at an equivalent wall-clock time, thus ensuring a fair evaluation considering both computational resources and model performance.
30th June 2025: Beginning of the competition, start of the pre-selection phase.
The challenge scenario config files are released together with the DevKit. The Codabench leaderboard starts accepting submissions!
31st Aug 2025: End of the pre-selection phase, start of final evaluation phase.
The submissions portal will stop accepting submissions. The highest-ranking participants will be asked to send their solutions and reports to the challenge organizers for the final evaluation.
19th Sep 2025: End of final evaluation phase.
The organizers will evaluate the strategies and reports from the highest-ranking participants on the novel scenarios and announce a final ranking. Participants with valid strategies will be asked to present them during the workshop.
19th Oct 2025: Workshop day.
Winners will present their solutions!
To participate in the challenge, use the link: https://www.codabench.org/competitions/9421/
Any FAQ will be updated here.
For further questions, contact