The ROAD-R dataset
Contact: roadr-info@googlegroups.com
About the Dataset
ROAD-R is an extension of the ROAD dataset with a set of 243 manually annotated requirements over the 41 labels grouped into agents, actions and locations. The requirements are logical constraints provided in disjunctive normal form and express background knowledge applicable in autonomous driving scenarios, such as:
A traffic light cannot be red and green at the same time.
A vehicle lane cannot be a parking lot,
A traffic light cannot move,
If an agent is crossing, it is either a pedestrian or a cyclist.
The dataset contains 22 videos, each about 8 minutes long, and is split into training (15 videos), validation (3 videos) and test (4 videos) partitions. Each video is annotated in terms of what we call road events (REs), as seen from the point of view of the autonomous vehicle capturing the video (Ego-vehicle). Road events are defined as a series of bounding boxes linked in time annotated with:
1. the label associated with the agent (e.g., “Pedestrian”),
2. the action(s) the agent is doing (e.g., “Pushing Object”, “Moving Away”), and
3. the location(s) where the agent is placed (e.g., “Right Pavement”, “Bus Stop”).
ROAD-R is part of a series of developments upon the ROAD dataset:
The ROAD dataset was built upon the Oxford RobotCar Dataset. It is released with a paper and 3D-RetinaNet code as a baseline, which also contains an evaluation code.
The ROAD-R dataset extends ROAD with 243 requirements and is the first publicly available dataset for autonomous driving and real-world scenarios with requirements expressed as logical constraints.
Recently, ROAD++ was introduced to extend ROAD with 1000 carefully selected and annotated, relatively long-duration (20 seconds each) videos from the Waymo dataset. The ROAD++ version is also following the same principles of ROAD, especially of being a multi-labels, multi-instance dataset.
The dataset was designed according to the following principles:
- A multi-label benchmark: each road event is composed of the label of the (moving) agent responsible, the label(s) of the type of action(s) being performed, and labels describing where the action is located. Each event can be assigned multiple instances of the same label type whenever relevant (e.g., an RE can be an instance of both moving away and turning left).
- The labelling is done from the point of view of the AV (ego-vehicle): the final goal is for the autonomous vehicle to use this information to make the appropriate decisions. The meta-data is intended to contain all the information required to fully describe a road scenario. After closing one’s eyes, the set of labels associated with the current video frame should be sufficient to recreate the road situation in one’s head (or, equivalently, sufficient for the AV to be able to make a decision).
- ROAD-R allows one to validate a manifold tasks associated with situation awareness for self-driving: (i) agent detection, (ii) action detection, and (iii) location detection.
- The requirements of ROAD-R provide the ground for developing safer autonomous vehicles.
Demo
Main features
Agent type labels, e.g., Pedestrian, Car, Cyclist, Large-Vehicle, Emergency-Vehicle etc.
Action annotations for humans as well as other road agents, e.g., Turning-right, Moving-away etc.
Semantic location labels of the location of agents, e.g. Vehicle-lane, Right-pavement etc.
~200K frames from 1k videos annotated, each video is 20 seconds long on average.
Track/tube id annotated for every bounding box in every frame, for every agent in the scene.
~55K tubes/tracks in total.
~4685K bounding box-level labels.
Download
To download ROAD-R dataset, please follow the instructions below:
Clone the ROAD dataset GitHub repo.
Cd to road directory.
Download the Train and Val videos and their agent, action and location annotations by running bash get_dataset.sh.
Extract the frames from the downloaded videos by running python extract_videos2jpgs.py <path-to-road-folder>/road/.
Download the annotated requirements from this repository, which contains the requirements in two formats:
a. requirements_dimacs.txt contains the requirements written in DIMACS format. Here, each label is represented as a number.
b. requirements_readable.txt contains the requirements written in a human-understandable format.
The natural language explanation of each requirement can be found in the appendix of the ROAD-R paper.
The road directory should look like the following:
road-r
├── road_trainval_v1.0.json
├── requirements_dimacs.txt
├── requirements_readable.txt
├── videos
├── 2014-06-25-16-45-34_stereo_centre_02.mp4
├── 2014-06-26-09-53-12_stereo_centre_02.mp4
├── ...
├── rgb-images
├── 2014-06-25-16-45-34_stereo_centre_02
├── 00001.jpg
├── 00002.jpg
├── ...
├── 2014-06-26-09-53-12_stereo_centre_02
├── 00001.jpg
├── 00002.jpg
├── ...
├── ...
The videos of the Test-set will be released in accordance with the schedule of our challenge.
Annotation structure
The annotations for the train and validation split are saved in single json file named road_trainval_v1.0.json, which is located under root directory of the dataset as it can be seen above.
The first level of road_trainval_v1.0.json contains dataset level information like classes of each label type:
Here are all the fields: dict_keys(['all_input_labels', 'all_av_action_labels', 'av_action_labels', 'agent_labels', 'action_labels', 'duplex_labels', 'triplet_labels', 'loc_labels', 'db', 'label_types', 'all_duplex_labels', 'all_triplet_labels', 'all_agent_labels', 'all_loc_labels', 'all_action_labels', 'duplex_childs', 'triplet_childs']).
all_input_labels: The list of all classes used to annotate the dataset.
label_types : The list of all label types ['agent', 'action', 'loc', 'duplex', 'triplet'].
all_av_action_labels: The list of all classes used to annotate AV actions.
av_action_labels: The list of the used AV actions.
Remaining fields ending with labels follows the same logic and AV actions described in above line.
duplex_childs and triplet_childs contain ids of child classes form agent, action or location labels to construct duplex or triplet labels.
duplex is constructed using agent and action classes.
event or triplet is constructed using agent, action, and location classes.
For this challenge, we use the labels contained in the lists: 'agent_labels','action_labels','loc_labels'.
Finally, the db field contains all frame and tube level annotations for all the videos:
To access the annotation for a video, use db['2014-06-25-16-45-34_stereo_centre_02'], where '2014-06-25-16-45-34_stereo_centre_02' is name of a video.
Each video annotation comes with the following fields: ['split_ids', 'agent_tubes', 'action_tubes', 'loc_tubes', 'duplex_tubes', 'triplet_tubes', 'av_action_tubes', 'frame_labels', 'frames', 'numf'].
Each field contains the following:
split_ids contains the split id assigned to this video out of 'test', 'train_1','val_1',...,'val_3'.
numf is number of frames in the video.
frame_labels is a list of length numf that contains the AV-action class id assigned for each frame of the videos.
frames is a dictionary containing frame-level annotations. Thus, for each frame of the video, it contains: ['annotated', 'rgb_image_id', 'width', 'height', 'av_action_ids', 'annos', 'input_image_id']. Each of the elements above represents:
annotated is a flag indicating whether a frame is annotated or not.
rgb_image_id = input_image_id is the id of a physical frame extracted by ffmpeg, ranging from 1 to numf.
av_action_ids: AV action label of the frame.
annos : contains annotations of the frame. All the bounding boxes in the frame along with their associated labels, and each bounding box has a unique ID. For example:
"annos": {"b19309": { "box": [0.34245960502692996,
0.423444976076555,
0.3631059245960503,
0.5179425837320574],
"agent_ids": [0],
"loc_ids": [6],
"action_ids": [4],
"duplex_ids": [1],
"triplet_ids": [18],
"tube_uid": "bbef3659"
}
"b433085": { "box": [0.5741350906095553,
0.44216691068814057,
0.58974519408777,
0.5230057739861901],
"agent_ids": [1],
"loc_ids": [0],
"action_ids": [9, 12],
"duplex_ids": [26, 31],
"triplet_ids": [315, 329],
"tube_uid": "51522791"
}
}
The previous annotation belongs to a frame that contains two bounding boxes (or two agents). The first bounding box (with ID "b19309") belongs to a Pedestrian (since the agent_id is 0) Moving towards the ego-vehicle on the Left Pavement, while the second ("b433085") belongs to Car that is Turning Left and IndicatingLeft (two action IDs) on the Vehicle Lane.
box is the bounding box in the Two-points format [xmin, ymin, xmax, ymax], and it is normalised, meaning the coordinates are between 0 and 1 (so they are irrespective of the frame's dimensions).
Fields ending with _ids describe the five types of labels to which the current bounding box belongs. You will be evaluated only on the fields: "agent_ids", "action_ids" and "loc_ids". Thus, you can ignore the duplex and triplet annotation.
tube_uid is the tube ID to which the current agent is belong (this is to connect the same agent appearing on multiple frames through time).
The fields ending in _tubes contain tube-level annotation of the corresponding label type.
for example, db['2014-06-25-16-45-34_stereo_centre_02']['agent_tubes'] contains tubes with fields like ['544e13cc-001-01', 'a074d1bf-001-01', 'e97b3e4c-001-01', 'edb6d66a-005-01',...]
each tube has following fields dict_keys(['label_id', 'annos'])
label_id is the class id from the respective label type.
annos is dictionary with keys made of frame_ids, e.g. ['agent_tubes']['10284a58-002-01']['annos'].keys() >> dict_keys(['4585', '4586',..., '4629', '4630'])
annos['4585'] = 'b19111' stores a unique key which points to frame-level annotations in frame number '4585'.