The ROAD++ Challenge: Event Detection for Situation Awareness in Autonomous Driving
Task and Challenges
We propose to organise the following 2 Challenges, making use of both ROAD and the new ROAD-Waymo datasets:
T1. Spatiotemporal agent detection: the output is in the form of ’agent tubes’ collecting the bounding boxes associated with an active road agent in consecutive frames (in an object tube formulation).
T2. Spatiotemporal road event detection: by road event we mean the triplet (Agent, Action, Location). Each road event is once again represented as a tube of frame-level detections. As the autonomous vehicle’s decisions make use of all three types of information provided by ROAD++, this task is very significant for autonomous driving applications.
Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).
* Challenge participants have 798 videos at their disposal for training and validation. The remaining 202 videos are to be used to test the final performance of their model. This will apply to all three Tasks.
Baseline
Inspired by the success of recent 3D CNN architectures for video recognition and of feature-pyramid networks (FPN) with focal loss, we propose a simple yet effective 3D feature pyramid network with focal loss as a baseline method for ROAD++’s detection tasks (3D-RetinaNet).
The code is publicly available on GitHub:
Important dates
Registration
To register in the challenge you may directly go to the challenge platform eval.ai: https://eval.ai/web/challenges/challenge-page/2043/overview
Dataset
You can download the dataset from https://github.com/salmank255/Road-waymo-dataset
The detail of the dataset can be found at https://github.com/salmank255/Road-waymo-dataset.
Evaluation
Performance will be calculated at video-level, for the following tasks:
Agent detection
Road event detection
Performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2 and 0.5 (signifying a 10%, 20% and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds. Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).
The evaluation will focus on the video (tube) level only, to stress our focus on understanding scenes which are dynamically evolving in an incremental fashion. When calculating the video mean average precision (video-mAP) we set the spatio-temporal Intersection over Union (IoU) detection threshold to 0.1, 0.2 and 0.5, as information on how predicted tubes align spatio-temporally with ground-truth tubes is available. The final performance on each task will be determined by the average of the performances at the three thresholds.