The ROAD++ Challenge: Event Detection for Situation Awareness in Autonomous Driving

Task and Challenges


We propose to organise the following 2 Challenges, making use of both ROAD and the new ROAD-Waymo datasets: 

Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).

* Challenge participants have 798 videos at their disposal for training and validation. The remaining 202 videos are to be used to test the final performance of their model. This will apply to all three Tasks.

Baseline

 Inspired by the success of recent 3D CNN architectures for video recognition and of feature-pyramid networks (FPN) with focal loss, we propose a simple yet effective 3D feature pyramid network with focal loss as a baseline method for ROAD++’s detection tasks (3D-RetinaNet). 


The code is publicly available on GitHub:

https://github.com/salmank255/ROAD_Waymo_Baseline

Important dates

Registration


To register in the challenge you may directly go to the challenge platform eval.ai: https://eval.ai/web/challenges/challenge-page/2043/overview

Dataset


You can download the dataset from https://github.com/salmank255/Road-waymo-dataset

The detail of the dataset can be found at https://github.com/salmank255/Road-waymo-dataset.


Submission


All the submission will be done in Eval.ai. For more information please check this link.

Evaluation


Performance will be calculated at video-level, for the following tasks:



 Performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2 and 0.5 (signifying a 10%, 20% and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds. Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s). 

The evaluation will focus on the video (tube) level only, to stress our focus on understanding scenes which are dynamically evolving in an incremental fashion. When calculating the video mean average precision (video-mAP) we set the spatio-temporal Intersection over Union (IoU) detection threshold to 0.1, 0.2 and 0.5, as information on how predicted tubes align spatio-temporally with ground-truth tubes is available. The final performance on each task will be determined by the average of the performances at the three thresholds.  

Challenge Result