To allow the research community to investigate safety-critical driving scene understanding for autonomous driving thoroughly, this workshop introduces the DADA-2000 dataset [1], a novel human-gaze-aligned egocentric accident video benchmark collected in driving scenes. DADA-2000 is a more diverse and larger video benchmark dataset annotated with the driver fixations for each video frame. Each gaze map is obtained by accumulating five subjects's fixation points and utilizing the Gaussian filter with (50×50) pixel kernel.
DADA-2000 collects over 658, 476 frames of human driver gaze fixations for 2000 video clips with the resolution of 1584× 660 on 54 kinds of accident scenarios. These clips are crowd-sourced and captured in various occasions (highway, urban, rural, and tunnel), weather (sunny, rainy and snowy) and light conditions (daytime and nighttime). Leveraging the features of the MM-AU dataset [2], each video is temporally aligned with the text descriptions of accident reasons, prevention solutions, accident categories, and object boxes manually annotated for video frames.
We propose to organize a Visual-Cognitive-Accident-Understanding (Cog-AU) Challenge, supported by the DADA-2000 dataset, and focus on the Critical Object Detection (COD) task.
Task. Critical Object Detection: Given an accident video clip, our goal is to output the bounding boxes that associated with human driver gaze fixations, which are considered critical objects. Specifically, we encourage participants to use the DADA-2000 dataset as a benchmark and incorporate multimodal large models(MLLM)for network modeling.
[1] J. Fang, D Yan, J Qiao, et al. , "DADA: Driver attention prediction in driving accident scenarios," IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp.4959-4971, 2021.
[2] Fang J, Li L, Zhou J, et al. , "Abductive ego-view accident video understanding for safe driving perception," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 22030-22040.
Two examples in DADA-2000. As shown in the figure above, the accident scenario contains numerous distracting(unimportant)objects, while the driver gaze fixation can focus on the critical objects.
To register and participate in the challenge, you can directly go to the challenge platform codabench.
DADA-2000-Train: [Google-drive] [Baidu Netdisk].
The details of DADA-2000 can be found here.
DADA-2000-Test: [Google-drive] [Baidu Netdisk].
The performance will be calculated at video-level, for the Critical Object Detection task:
For the Critical Object Detection task, the performance is measured by Video Mean Average IoU (Video-mIoU). It is computed by the Intersection over Union (IoU) detection threshold to 0.0, 0.1, 0.2, and 0.5 (indicating that the overlap between predicted and the critical object ground truth (GT) bounding box observed by the human driver gaze exceeds 0.0%, 10%, 20% and 50%, respectively). Because of the challenging nature of the data, the final performance will be determined by the equally weighted average of the performances at the four thresholds. The evaluation will focus on the video level only, to stress our focus on understanding the safety-critical driving scenes within dynamic evolution.
As shown in the above figure, for each frame in the video, the Intersection over Union (IoU) metric is calculated at different thresholds by comparing the predicted boxes with the critical object boxes observed by the human driver gaze in the testing dataset. Various conditional information is encouraged when training your model. In addition, participants also can utilize extra useful conditions (e.g., depth map, semantic map, etc.) preprocessing from other foundation models.