Challenge

Challenge Overview

It is fair to assume that data is not cheap to acquire, store and label in real-world machine learning applications. Therefore, it is crucial to develop strategies that are flexible enough to learn from streams of experiences, without forgetting what has been learned previously. Additionally, contextual unlabelled data can also be exploited to integrate additional information into the model.

This challenge aims to explore techniques that combine these two fundamental aspects of data efficiency: continual learning and unlabelled data usage.

Strategies can be submitted at the CodaLab competition: https://codalab.lisn.upsaclay.fr/competitions/17780

The challenge DevKit can be accessed on GitHub: https://github.com/ContinualAI/clvision-challenge-2024

The pre-selection phase will run until May 13th 2024.

A prize of 1,000 dollars is sponsored by Apple for the top-ranking participants.

Challenge Goals

The goal of this challenge is to tackle the Class-Incremental with Repetition (CIR) problem by exploiting unlabelled data.

CIR encompasses a variety of scenarios with two key characteristics:

1. previously observed classes can re-appear in a new experience with arbitrary repetition patterns, and
2. not all classes are available in every experience.

In this competition, each scenario is divided into 50 experiences. In each experience, we have a training session with access to a set of labelled and unlabelled samples. During this time we can train, update or adapt our model using the labelled data, while supported with the unlabelled data. However, once the training session is over, both labelled and unlabelled data becomes unavailable. Depending on the scenario, at future experiences it will be possible to have access to samples from seen, unseen or distractor classes. Distractor classes represent elements in the stream that can be sampled, but are not required to be learned or classified, and therefore never appear in the labelled stream.

Based on these last three restrictions, three scenarios are proposed to test the robustness of the strategies developed by the participants. Each of the scenarios presents a Labelled Data Stream (LS) and an Unlabelled Data Stream (US). Depending on the scenario, the US can contain samples belonging to:

1. the same classes as the LS,
2. samples from all labelled classes (including the labelled ones not seen),
3. distractor classes not present in the LS nor in the evaluation set.

Participants are asked to develop strategies that, after the model has finished training on the entire stream of experiences, achieve high average accuracy on an evaluation test set which contains a balanced number of unseen samples from all classes in the LS. The proposed strategy will learn a different model for each scenario, but has to apply the same algorithm to all three scenarios.

Metrics

We evaluate all submissions with the classic final accuracy on an evaluation test set. The final accuracy metric measures the accuracy of predictions on a test set containing novel instances from all previously seen classes. The test set contains a balanced representation of new instances from all classes seen as labelled during the trained sequence. No images from the distractor classes are included. After training on each scenario, the competitors have to provide a class prediction for each of the images in the test set.

The leaderboard will be ranked by the average final accuracy among the three scenarios. However, we also will provide another tie-break metric within the leaderboard: convergence rate of accuracy over experiences. After each experience, a prediction on the test set is provided and the corresponding accuracy is stored. This accuracy is calculated only with regard to test samples that belong to the classes seen so far. The formula for the accuracy and convergence rate are:

where the weights wj give more importance to variance within the later experiences.

For more information on the format of the prediction submission, check the details here: https://github.com/ContinualAI/clvision-challenge-2024.

Scenarios

This challenge consists of three scenarios based on an [ImageNet]-like computer vision dataset with a fixed number of classes. Each scenario consists of 50 experiences with a 500 labelled images and 1,000 unlabelled images. These images constitute the above-mentioned Labelled Data Stream (LS) and Unlabelled Data Stream (US) at each experience.

Number of experiences: 50
Number of labelled samples in each experience: 500
Number of unlabelled samples in each experience: 1,000

Samples are equally balanced among present classes in each experience. More details on the scenario distributions at the bottom of this page.

Evaluation and Common Rules

Participants are challenged to develop new strategies using the provided DevKit. The challenge will be articulated in two different phases:

Pre-selection: participants are asked to run their strategies and provide predictions on their machines. Those results are submitted to the CodaLab platform and the submission score is calculated and uploaded to the leaderboard.
Final evaluation: the top five strategies with the highest average test accuracy will be evaluated on novel scenarios, similar to the ones provided in the competition but with small random variations. For that, the participants will be asked to share the code for their strategies with the organizers to assess that no foul play was involved. These variations are intended to test the robustness of the strategies submitted. The top strategy and the final ranking will be announced during the CLVISION workshop at CVPR 2024.

The top strategies might be asked to submit a report and prepare a (short) presentation to be given during the workshop. Report papers may optionally be asked from teams that have submitted interesting solutions, even among non-winning ones.

The DevKit is based on Avalanche. Changing the data loading process and competition-related modules is not permitted.

Participants are allowed to work in teams, but only one member can submit predictions to the CodaLab system. Each team is allowed 3 submissions per day, with a limitation of 50 total submissions throughout the competition. Using multiple accounts on CodaLab to increase the number of submissions is prohibited.

The organizers reserve the absolute right to disqualify entries that are incomplete or illegible, late entries, or entries that violate the rules.

Strategy Restrictions

Submission: for each submission, the predictions for the three scenarios must come from the same strategy. The strategy must be able to solve the three settings without having a scenario-ID, since it will have to work fine on the novel scenarios in the final phase. In general, this can be seen as the strategy being able to solve the more complex scenario 3, while still being able to solve the simpler experience sequences from scenarios 1 and 2. No data from external sources can be used.

Strategy Design: within each experience, users have full access to the data of that experience. No data from other experiences can be accessed. The default number of epochs or training regime for each experience can be modified. The participants are free to adapt and tailor the epoch iterations and dataset loading. As an example, one may iterate for more epochs in the initial experiences and less in the final ones depending on a particular criterion.

Model Architecture: all participants must use the ResNet-18 provided in the DevKit as the base architecture for their models. However, they are allowed to add additional modules, e.g. gating modules, as long as they do not exceed the maximum GPU memory allowed for the competition. The model cannot be initialized using pretrained weights.

Replay Buffer: Replay buffers may not be used to store dataset images. However, buffers may be used to store any form of data representation or statistics, such as the model's internal representations. Regardless of buffer type, the buffer size (i.e., the total number of stored exemplars) should not exceed 200.

Hardware Limitations

Number of GPUs: Participants are allowed to use 1 GPU for training only.

Max GPU Memory Usage: 8000 MB

Max Training Time: 600 Min

Hardware usage is monitored by the DevKit after each experience. These restrictions are set based on training sessions conducted locally for baseline strategies.

As a reference, evaluation of the submitted strategies during the final phase will be done in a machine with an NVIDIA TITAN RTX (24Gb GPU) with 64Gb of RAM and 12 CPU cores.

Schedule (all times are AoE timezone)

20th February 2024: Beginning of the competition, start of the pre-selection phase.

The challenge scenario config files are released together with the DevKit. The CodaLab leaderboard starts accepting submissions!

13th May 2024: End of the pre-selection phase, start of final evaluation phase.

The submissions portal will stop accepting submissions. The highest-ranking participants will be asked to send their solutions and reports to the challenge organizers for the final evaluation.

18th May 2024: End of final evaluation phase.

The organizers will evaluate the strategies and reports from the highest-ranking participants on the novel scenarios and prepare a final ranking to be revealed on the workshop day. Participants with valid strategies will be asked to present them during the workshop.

18th June 2024: Workshop day.

Winners will present their solutions!

Challenge Portal

To participate in the challenge, use the link: https://codalab.lisn.upsaclay.fr/competitions/17780

THE COMPETITION HAS NOW FINISHED.

FAQ

The download link does not work. What do I do?
The dataset is privately hosted and the server might be temporarily offline. We encourage you to try again after a few minutes.
We also provide a secondary mirror link [here].

How many samples can be stored?
We allow storage of 200 samples. The scenarios assume that the buffer size is agnostic regarding the number of classes.
How is a sample defined?
An individual sample stored in the replay buffer cannot exceed the size of 1024 floating-points (in vector, matrix or other format). Each sample can encompass information pertaining to one or multiple samples or classes. Raw images, as well as any straightforward derivatives of them, are not allowed to be stored. Prohibited modifications include, among others: slicing, selecting specific segments of the image, pixel-wise alterations, basic color adjustments and aggregates of multiple images.
Why are raw images not allowed?
We frame the scenario of the competition to be exemplar-free, or within the spirit of not storing direct sample information (or direct modifications, see above). Our intention is that participants explore ways of using the replay buffer without directly storing raw images. For example, the participants have the option to accumulate prototypes or class-wise information/statistics from the seen samples.
Can we use pretrained weights?
No, the model cannot be initialized using pretrained weights. Moreover, during training, no data from external sources can be used.
How do we submit as a team?
Each team is allowed a total of 50 submissions, but only one member can submit predictions to the CodaLab system. Therefore, we ask for all submissions to be made from a single account representing the team. If your team has already submitted from different accounts, please send an email to <guglielmo AT tugraz DOT at> with the list of members.
Am I using the correct pickle configuration files?
On April 9th 2024, we updated the pickle configuration files to address minor discrepancies. If you have not pulled/updated your DevKit from the official repository since then, we kindly request that you do so.
How are submissions evaluated?
Submissions are evaluated according to three metrics:
- - Accuracy -- After each task in scenarios S1, S2 and S3, the model is evaluated on a test set containing novel instances from all classes seen up to that moment and the accuracy at task t is captured. Accuracy S1/S2/S3 is equal to the accuracy at the last task.
  - Average Accuracy -- It is the average of the three accuracies S1/S2/S3.
  - Convergence rate -- Given a scenario, this metric identifies the oscillation of accuracy from task to task. The less it oscillates, the better. To capture this behavior, the convergence rate is defined as the standard deviation of the difference between subsequent accuracies. The three convergence rates for S1, S2 and S3 are the averaged to get the final metric.

How are distractor classes constructed from the dataset?
The dataset provided for the competition includes 130 classes. However, only the images of 100 classes (presented in labelled and unlabelled form) should be learned during the sequence of experiences. On Scenario 3, the remaining 30 classes are mixed within the data as unlabelled distractors, which are not part of the evaluation set (are not required to be learned). Note that the number of learnable and distractor classes can be different in the finalist scenarios.

Scenario 1: LS and US contain the same classes in each experience.

Scenario 2: US contains the same classes as LS, as well as past or future classes from the whole LS.

Scenario 3: US contains the same classes as LS, as well as past or future classes from the whole LS, and distractor classes not present in LS.

For further questions, contact

clvision2024 AT gmail DOT com

Google Sites

Report abuse