The ROAD++ Challenge: Event Detection for Situation Awareness in Autonomous Driving in Domain Adaptation

Task and Challenges


We propose to organize the 3 Challenges, 2 detection challenges making use of both ROAD and the new ROAD-Waymo datasets, and 1 recognition on the TACO dataset.

Agent and Road Event Detection Challenges

Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).

* Challenge participants have 798 videos at their disposal for training and validation. The remaining 202 videos are to be used to test the final performance of their model. This will apply to both two tasks.

Baseline

 Inspired by the success of recent 3D CNN architectures for video recognition and of feature-pyramid networks (FPN) with focal loss, we propose a simple yet effective 3D feature pyramid network with focal loss as a baseline method for ROAD++’s detection tasks (3D-RetinaNet). 


The code is publicly available on GitHub:

https://github.com/salmank255/ROAD_Waymo_Baseline


Atomic Activity Recognition Challenge

* Challenge participants have 4030 videos at their disposal for training and validation. The remaining 1148 videos are to be used to test the final performance of their model. 

Baseline

The task can be formulated as a multi-label action recognition task so we use conventional action recognition models X3D as a baseline supported by PtorchVideo Library. We also provide the state-of-the-art method, Action-slot, a slot-attention model using X3D as the backbone encoder.


The code is publicly available on GitHub:

https://github.com/HCIS-Lab/Action-slot/tree/main

Important dates

Registration


To register for the challenge you may directly go to the challenge platform eval.ai

Dataset Download


ROAD: []

The details of TACO can be found here.


TACO: [One-drive]

The details of TACO can be found here.


Evaluation


The performance will be calculated at video-level, for the following tasks:



For T1 and T2, the performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2, and 0.5 (signifying a 10%, 20%, and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds. Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s). 

The evaluation will focus on the video (tube) level only, to stress our focus on understanding scenes which are dynamically evolving in an incremental fashion. When calculating the video mean average precision (video-mAP) we set the spatio-temporal Intersection over Union (IoU) detection threshold to 0.1, 0.2, and 0.5, as information on how predicted tubes align spatio-temporally with ground-truth tubes are available. The final performance on each task will be determined by the average of the performances at the three thresholds.  

For T3, the performance is measured by mean average precision (mAP), a common metric used in multi-label video recognition tasks. We also report agent-wise mAP, including four-wheeler (mAP@c), two-wheeler (mAP@k), pedestrian (mAP@p), grouped four-wheelers (mAP@c+), grouped two-wheelers (mAP@k+), and grouped pedestrians (mAP@p+).

Challenge Result