The ROAD++ Challenge: Event Detection for Situation Awareness in Autonomous Driving in Domain Adaptation

Challenge Result

Note that if the teams tie with the major metric, e.g., mAP in T3, we will rank with the numbers with rounding. We consider all submissions, including private ones.

We kindly inform the winners to reach out to our organizers as we can not access the participants' information from eval.ai. We would like to know whether the winner would like to join the workshop in-person and share their remarkable work.

We strongly encourage winners to publish their technical reports to arxiv so that we can learn more about your remarkable work!

Task and Challenges

We propose to organize the 3 Challenges, 2 detection challenges making use of both ROAD and the new ROAD-Waymo datasets, and 1 recognition on the TACO dataset.

Agent and Road Event Detection Challenges

T1. Spatiotemporal agent detection: the output is in the form of ’agent tubes’ collecting the bounding boxes associated with an active road agent in consecutive frames (in an object tube formulation).
T2. Spatiotemporal road event detection: by road event we mean the triplet (Agent, Action, Location). Each road event is once again represented as a tube of frame-level detections. As the autonomous vehicle’s decisions make use of all three types of information provided by ROAD++, this task is very significant for autonomous driving applications.

Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).

* Challenge participants have 798 videos at their disposal for training and validation. The remaining 202 videos are to be used to test the final performance of their model. This will apply to both two tasks.

Baseline

Inspired by the success of recent 3D CNN architectures for video recognition and of feature-pyramid networks (FPN) with focal loss, we propose a simple yet effective 3D feature pyramid network with focal loss as a baseline method for ROAD++’s detection tasks (3D-RetinaNet).

The code is publicly available on GitHub: https://github.com/salmank255/ROAD_Waymo_Baseline

Atomic Activity Recognition Challenge

T3. Multi-label atomic activity recognition: the task is formulated as a multi-label action recognition, where the output is 64 classes of multi-label prediction of atomic activity for a clip. The activity class is defined as (region_srart -> region_end: agent_type), where region_start and region_end denote the road topology of an intersection (see illustration here).

* Challenge participants have 4030 videos at their disposal for training and validation. The remaining 1148 videos are to be used to test the final performance of their model.

Baseline

The task can be formulated as a multi-label action recognition task so we use conventional action recognition models X3D as a baseline supported by PtorchVideo Library. We also provide the state-of-the-art method, Action-slot, a slot-attention model using X3D as the backbone encoder.

The code is publicly available on GitHub: https://github.com/HCIS-Lab/Action-slot/tree/main

Registration

To register for the challenge you may directly go to the challenge platform eval.ai

Dataset Download

TACO: [One-drive]

The details of TACO can be found here.

Evaluation

The performance will be calculated at video-level, for the following tasks:

T1: Agent Detection
T2: Road Event Detection
T3: Atomic Activity Recognition

For T1 and T2, the performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2, and 0.5 (signifying a 10%, 20%, and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds. Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).

The evaluation will focus on the video (tube) level only, to stress our focus on understanding scenes which are dynamically evolving in an incremental fashion. When calculating the video mean average precision (video-mAP) we set the spatio-temporal Intersection over Union (IoU) detection threshold to 0.1, 0.2, and 0.5, as information on how predicted tubes align spatio-temporally with ground-truth tubes are available. The final performance on each task will be determined by the average of the performances at the three thresholds.

For T3, the performance is measured by mean average precision (mAP), a common metric used in multi-label video recognition tasks. We also report agent-wise mAP, including four-wheeler (mAP@c), two-wheeler (mAP@k), pedestrian (mAP@p), grouped four-wheelers (mAP@c+), grouped two-wheelers (mAP@k+), and grouped pedestrians (mAP@p+).

Page updated

Google Sites

Report abuse