The ROAD++ Challenge: Event Detection for Situation Awareness in Autonomous Driving in Domain Adaptation
Task and Challenges
We propose to organize the 3 Challenges, 2 detection challenges making use of both ROAD and the new ROAD-Waymo datasets, and 1 recognition on the TACO dataset.
Agent and Road Event Detection Challenges
T1. Spatiotemporal agent detection: the output is in the form of ’agent tubes’ collecting the bounding boxes associated with an active road agent in consecutive frames (in an object tube formulation).
T2. Spatiotemporal road event detection: by road event we mean the triplet (Agent, Action, Location). Each road event is once again represented as a tube of frame-level detections. As the autonomous vehicle’s decisions make use of all three types of information provided by ROAD++, this task is very significant for autonomous driving applications.
Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).
* Challenge participants have 798 videos at their disposal for training and validation. The remaining 202 videos are to be used to test the final performance of their model. This will apply to both two tasks.
Baseline
Inspired by the success of recent 3D CNN architectures for video recognition and of feature-pyramid networks (FPN) with focal loss, we propose a simple yet effective 3D feature pyramid network with focal loss as a baseline method for ROAD++’s detection tasks (3D-RetinaNet).
The code is publicly available on GitHub:
https://github.com/salmank255/ROAD_Waymo_Baseline
Atomic Activity Recognition Challenge
T3. Multi-label atomic activity recognition: the task is formulated as a multi-label action recognition, where the output is 64 classes of multi-label prediction of atomic activity for a clip. The activity class is defined as (region_srart -> region_end: agent_type), where region_start and region_end denote the road topology of an intersection (see illustration here).
* Challenge participants have 4030 videos at their disposal for training and validation. The remaining 1148 videos are to be used to test the final performance of their model.
Baseline
The task can be formulated as a multi-label action recognition task so we use conventional action recognition models X3D as a baseline supported by PtorchVideo Library. We also provide the state-of-the-art method, Action-slot, a slot-attention model using X3D as the backbone encoder.
The code is publicly available on GitHub:
Important dates
Evaluation
The performance will be calculated at video-level, for the following tasks:
T1: Agent Detection
T2: Road Event Detection
T3: Atomic Activity Recognition
For T1 and T2, the performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2, and 0.5 (signifying a 10%, 20%, and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds. Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).
The evaluation will focus on the video (tube) level only, to stress our focus on understanding scenes which are dynamically evolving in an incremental fashion. When calculating the video mean average precision (video-mAP) we set the spatio-temporal Intersection over Union (IoU) detection threshold to 0.1, 0.2, and 0.5, as information on how predicted tubes align spatio-temporally with ground-truth tubes are available. The final performance on each task will be determined by the average of the performances at the three thresholds.
For T3, the performance is measured by mean average precision (mAP), a common metric used in multi-label video recognition tasks. We also report agent-wise mAP, including four-wheeler (mAP@c), two-wheeler (mAP@k), pedestrian (mAP@p), grouped four-wheelers (mAP@c+), grouped two-wheelers (mAP@k+), and grouped pedestrians (mAP@p+).