The ROAD Challenge: Event Detection for Situation Awareness in Autonomous Driving


Task and Challenges


ROAD allows one to validate detection tasks associated with any meaningful combination of the three base labels. For this Challenge we consider three video-level detection Tasks:

  • T1. Agent detection, in which the output is in the form of agent tubes collecting the bounding boxes associated with an active road agent in consecutive frames.

  • T2. Action detection, where the output is in the form of action tubes formed by bounding boxes around an action of interest in each video frame.

  • T3. Road event detection, where by road event we mean a triplet (Agent, Action, Location) as explained above, once again represented as a tube of frame-level detections.

Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).

* Challenge participants have 18 videos at their disposal for training and validation. The remaining 4 videos are to be used to test the final performance of their model. This will apply to all three Tasks.

Baseline

As a baseline for all three detection tasks we propose a simple yet effective 3D feature pyramid network with focal loss, an architecture we call 3D-RetinaNet:

http://arxiv.org/abs/2102.11585

The code is publicly available on GitHub:

https://github.com/gurkirt/3D-RetinaNet

Important dates

Registration


To register in the challenge you may directly go to the challenge platform eval.ai: https://eval.ai/web/challenges/challenge-page/1059/overview

Dataset


You can download the dataset from here (https://github.com/gurkirt/road-dataset) .

The detail of the dataset can be found in dataset tab.

The 4 test set videos for the challenge can now be found at this Google Drive link


Submission


Participants will be able to submission their results via the eval.ai platform: https://eval.ai/web/challenges/challenge-page/1059/submission

Evaluation


Performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2 and 0.5 (signifying a 10%, 20% and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds.

In the first stage of the Challenge participants will, for each task, submit their predictions as generated on the validation fold and get the evaluation metric in return, in order to get a feel of how well their method(s) work. In the second stage they will submit the predictions generated on the test fold which will be used for the final ranking.

A separate ranking will be produced for each of the Tasks.

Evaluation will take place on the EvalAI platform. For each Challenge stage and each Task the maximum number of submissions is capped at 50, with an additional constraint of 5 submissions per day.

Detailed instructions about how to download the data and submit your predictions for evaluation at both validation and test time, for all three Tasks, are provided on the Challenge website.

Challenge Result



The detail of the result can be found in here (Link)