Challenges

The Workshop proposes the following four CSSL Challenges, making use of the both the new MEVA-CL dataset (CAR-A and CAR-I) and the new CCCL continual learning benchmark for crowd counting (CCC-A and CCC-I):

CAR-A Continual Semi-supervised Activity Detection – Absolute.

The goal is to both achieve the best average performance across all the unlabelled portions (i.e., test fold) in the MEVA-CL dataset in a CSSL setting, leaving the choice of the base detector model to the participants.

CAR-I Continual Semi-supervised Activity Detection – Incremental.

The goal here is to achieve the best performance improvement over time by the incrementally updated model over validation and test fold and the average performance of the initial supervised model on the test fold.

A baseline detector model is provided by organizers at https://github.com/salmank255/IJCAI-2021-Continual-Activity-Recognition-Challenge

The baseline model is the recent EfficientNet1 network (model EfficientNet-B5), pre-trained on the large-scale ImageNet dataset. Detailed information about its implementation, along with pre-trained models, can be found on GithubLhttps://github.com/lukemelas/EfficientNet-PyTorch and is easily downloadable using the Python command “pip” (pip install efficientnet-pytorch).

Note that the performance of the baseline activity model is rather poor on our challenge videos, as relevant activities occupy only a small fraction of the duration of the videos. This leaves much room for improvement while properly representing the level of challenge real-world data poses.

CCC-A Continual Semi-supervised Crowd Counting – Absolute.

As in C1, the goal is to achieve the best average performance across the unlabelled data streams (i.e., test fold) in the CCCL dataset, leaving the choice of the base crowd counting model to the participants.

CCC-I Continual Semi-supervised Crowd Counting – Incremental.

The goal here is to achieve the best performance improvement over time by the incrementally updated model over validation and test fold and the average performance of the initial supervised model on the test fold.

A baseline crowd counting model is provided by organizers at https://github.com/Ajmal70/IJCAI_2021_Continual_Crowd_Counting_Challenge. As baseline crowd counting model we selected the Multi-Column Convolutional Neural Network (MCNN)2. Its implementation, along with pre-trained models, can also be found on Github https://github.com/svishwa/crowdcount-mcnn .This network is implemented using PyTorch. Pre-trained models are available for both the ShanghaiTech A and the ShanghaiTech B datasets. For this Challenge we chose to adopt the ShanghaiTechB pre-trained model.

Protocol for Incremental Training and Testing

Following from the problem definition, once a model is fine-tuned on the supervised portion of a data stream it is subsequently both incrementally updated using the unlabelled portion of the same data stream and tested there, using the provided ground truth.

Importantly, incremental training and testing must happen independently for each sequence, as we intent to simulate real-world scenarios in which a smart device with continual learning capability can only learn from its own data stream after deployment.

Split into Training, Validation and Test

The data for the challenges are released in two stages:

1. We will first release the supervised portion of each data stream, together with a portion of the unlabelled data stream to use for the validation of the semi-supervised continual learning approach proposed by the participants.

2. The remaining portion of the unlabelled data stream for each sequence in the dataset is released at a later stage to be used for the testing of the proposed approach.

Consequently, each data stream (sequence) in our benchmarks is divided into a supervised train fold (S), a validation fold (V) and a test fold (T).

For the CAR challenge, the supervised fold for each sequence coincides with the first 5-minute video, the validation fold with the second 5-minute video, and the test fold with the third 5-minute video.

For the CCC challenge we distinguish two cases. For the 2,000-frame sequences from either the UCSD or the Mall dataset, S is formed by the first 400 images, V by the following 800 images, and T by the remaining 800 images. For the 750-frame sequence from the FDST dataset, S is the set of the first 150 images, V the set of the following 300 images, and T the set of remaining 300 images.

Baseline incremental learning approach

Our baseline for incremental learning from unlabeled data stream is based on a vanilla self-training approach.

For each sequence, the unlabeled data stream (without distinction between validation and test folds) is partitioned into a number of sub-folds. Each sub-fold spans 1 minute in the CAR challenges, so that each unlabeled sequence is split into 10 sub-folds. Sub-folds span 100 frames in the CCC challenges, so that the UCSD and MALL sequences comprise 16 sub-folds (eight sub-folds each for validation and testing) whereas the FDST sequence contains only six sub-folds (three sub-folds each for validation and testing) .

Starting with the model initially fine-tuned on the supervised portion of the data stream, self-training is iteratively applied in a batch fashion to each sub-fold. The predictions generated by the model obtained after convergence upon a sub-fold are the baseline predictions for the current sub-fold. The output of each self-training session is used as start model for the following session.

Challenges Rules

Continual semi-supervised learning is a new problem, for which there is no separation between the data used for training and that used for testing, as in traditional learning tasks. Once a model is fine-tuned on the labelled portion T₀ of a data stream, it is both incrementally updated using the unlabelled portion of the data stream and tested there, using the provided ground truth. Importantly, incremental training and testing must happen independently for each sequence.

The organisers reserve the right to reproduce their results and check their validity. In accordance with the principles of semi-supervised continual learning, fine-tuning a model on labelled portions of multiple data streams is not allowed. i.e., performance will be measured on models customized for each data stream independently.

Evaluation Protocols

Participants will be able to evaluate the performance of their method(s) on both the incremental and the absolute versions of the challenges on eval.ai.

Crowd-Counting challenge: https://eval.ai/web/challenges/challenge-page/986/overview

Activity Recognition challenge: https://eval.ai/web/challenges/challenge-page/984/overview

In Stage 1 participants will, for each task (CAR-A, CAR-I, CCC-A, CCC-I), submit their predictions as generated on the validation folds and get the evaluation metric in return, in order to get a feel of how well their method(s) work. In Stage 2 they will submit the predictions generated on the test folds which will be used for the final ranking.

For CCC-A and CCC-I, participants are required to submit their predictions as a single csv file where the predictions are arranged as fdst (1-300), ucsd (301-1100) and mall (1101-1900). The code is available on the baseline github to convert 1900 csv predictions as a single csv file showing prediction count only. The organizers reserve right to ask for 1900 csv files too if needed.

A separate ranking will be produced for each of the tasks. For each of the challenge stages and each task the maximum number of submissions through the EvalAI platform is capped at 50, with an additional constraint of 5 submissions per day.

Please send your CCC related questions to:

Ajmal Shahbaz at ashahbaz@brookes.ac.uk
Mohamad Asiful Hossain at mohammad.asiful.hossain@huawei.com
Kevin Cannons at kevin.cannons@huawei.com

Please send your CAR related questions:

Salman Khan at 19052999@brookes.ac.uk
Vincenzo Lomonaco at vincenzo.lomonaco@unipi.it