The ROAD dataset

About the Dataset

ROAD dataset is build upon Oxford Robot Car Dataset (OxRD). Please cite the original dataset if it useful in your work, citation can be found here. It is released with a paper and 3D-RetinaNet code as a baseline. Which also contains evaluation code. ROAD dataset will be used in our challenge.

Similar to original work (OxRD), this work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License and is intended for non-commercial academic use. If you are interested in using the dataset for commercial purposes please contact original creator OxRD for video content and Fabio and Gurkirt for event annotations.

ROAD is the result of annotating 22 carefully selected, relatively long-duration (ca 8 minutes each) videos from the RobotCar dataset in terms of what we call road events (REs), as seen from the point of view of the autonomous vehicle capturing the video. REs are defined as triplets E = (Ag;Ac; Loc) composed by a moving agent Ag, the action Ac it performs, and the location Loc in which this takes place. Agent, action and location are all classes in a finite list compiled by surveying the content of the 22 videos. Road events are represented as ’tubes’, i.e., time series of frame-wise bounding box detections.
The dataset was designed according to the following principles:

A multi-label benchmark: each road event is composed by the label of the (moving) agent responsible, the label(s) of the type of action(s) being performed, and labels describing where the action is located. Each event can be assigned multiple instances of the same label type whenever relevant (e.g., an RE can be an instance of both moving away and turning left).

The labelling is done from the point of view of the AV: the final goal is for the autonomous vehicle to use this information to make the appropriate decisions. The meta-data is intended to contain all the information required to fully describe a road scenario. After closing one’s eyes, the set of labels associated with the current video frame should be sufficient to recreate the road situation in one’s head (or, equivalently, sufficient for the AV to be able to make a decision).

ROAD allows one to validate manifold tasks associated with situation awareness for self driving, each associated with a label type (agent, action, location) or combination thereof: spatiotemporal (i) agent detection, (ii) action detection, (iii) location detection, (iv) agentaction detection, (v) road event detection, as well as the (vi) temporal segmentation of AV actions. For each task one can assess both frame-level detection, which outputs independently for each video frame the bounding box(es) (BBs) of the instances there present and the relevant class labels, and video-level detection, which consists in regressing the whole series of temporally-linked bounding boxes (i.e., in current terminology, a ’tube’) associated with an instance, together with the relevant class label.

Demos

Main features

Action annotations for human as well as other road agents, e.g. Turning-right, Moving-away etc.
Agent type labels, e.g. Pedestrian, Car, Cyclist, Large-Vehicle, Emergency-Vehicle etc.
Semantic location labels of the location of agent, e.g. in vehicle lane, in right pavement etc.
122K frames from 22 videos annotated, each video is 8 mins long on an average.
track/tube id annotated for every bounding box on every frame for every agent in the scene.
7K tubes/tracks of individual agents.
- Each tube consists on average of approximately 80 bounding boxes linked over time.
559K bounding box-level agent labels.
641K and 498K bounding box-level action and location labels.
122k annotated with self/ego-actions of AV as well, e.g. AV-on-the-mov, AV-Stopped, AV-turning-right, AV-Overtaking etc.

Download

Please follow the instruction from here (https://github.com/gurkirt/road-dataset#download)

OR You can download the Train-Val-set videos and annotation from Google-Drive link

Private video of Test-set will be released in accordance with the schedule of our challenge.

Annotation structure

The dataset contains, as you can see in the following figure, a 'videos' folder and corresponding '.zip' file including all 18 videos, and 'road_trainval_v1.0.json' which comprise of the annotations. More detail can be found in here (https://github.com/gurkirt/road-dataset#frame-extraction)