ROAD++ is an extension of our previously released dataset ROAD dataset with an even larger number of multi-label annotated videos from the Waymo dataset, ROAD-UAE (UAE), and TACO (CARLA simulation). Such videos span an even wider range of conditions, over different cities located in the United States. Given that ROAD++ encompasses videos from both the United Kingdom and the United States, it can be used as a benchmark not only for action detection models but also for domain adaptation models. In the future, we plan to further extend it to novel cities, countries, and sensor configurations, with the long-term goal of creating an even more robust, “in the wild” setting.
ROAD dataset was built upon the Oxford RobotCar dataset Dataset. Please cite the original dataset if it useful in your work, citation can be found here. It is released with a paper and 3D-RetinaNet code as a baseline. Which also contains an evaluation code.
ROAD++ (the new version itself) is of significant size, as 1000 videos are labeled for a total of ∼ 4.6M detection bounding boxes in turn associated with 14M unique individual labels, broken down into 3.9M agent labels, 4.3M action labels, and 4.2M location labels. The ROAD++ version is also following the same principles of ROAD, especially the multi-label, multi-instance dataset.
ROAD++ is the result of annotating ~55k carefully selected, relatively long-duration (20 seconds each) videos from the Waymo dataset in terms of what we call road events (REs), as seen from the point of view of the autonomous vehicle capturing the video. REs are defined as triplets E = (Ag;Ac; Loc) composed by a moving agent Ag, the action Ac it performs, and the location Loc in which this takes place. Agent, action, and location are all classes in a finite list compiled by surveying the content of the 55k videos. Road events are represented as ’tubes’, i.e., time series of frame-wise bounding box detections.Action annotations for human as well as other road agents, e.g. Turning-right, Moving-away etc.
Agent type labels, e.g. Pedestrian, Car, Cyclist, Large-Vehicle, Emergency-Vehicle etc.
Semantic location labels of the location of agent, e.g. in vehicle lane, in right pavement etc.
198K frames from 1000 videos annotated, each video is 20 seconds long on an average.
track/tube id annotated for every bounding box on every frame for every agent in the scene.
54K tubes/tracks of individual agents.
3.9M bounding box-level agent labels.
4.3M and 4.2M bounding box-level action and location labels.
The TACO dataset is designed for the atomic activity recognition tasks The task offers an expressive description to ground the road topology in road users' action and can largely reduce the annotation cost compared to frame-wise action detection labeling. To overcome the long-tail distribution in the real world, we collect TACO in the CARLA simulator, which enables a more efficient collection for a diverse, balanced, and large-scale dataset.
Annotation of atomic activity: given a short clip, we label a road user's action by defining the movement on the different regions at the intersection, An atomic activity class is formulated as:
(region_start -> region_end: agent_types)
, where the regions denote the 4 roadways (Z1, Z2, Z3, Z4) and 4 corners (C1, C2, C3, C4) in an intersection and agent_type can be one of the 6 types: four-wheeler (C), Two-wheeler (K), Pedestrian (P), Grouped four-wheelers (C+), Grouped two-wheelers (K+), and Grouped pedestrians (P+).
Such a combination of movement on the defined road topology and the types of agents can compose 64 classes of atomic activities in total, which can be formulated as a multi-label action recognition task.
To this end, we collect the largest dataset for atomic activity recognition with 5178 clips and 16521 atomic activity labels. The balanced distribution also offers a more comprehensive analysis of models on diverse scenarios.
The images are collected with a size of 512x1536. The average number of frames in a clip is 109.3 with 20Hz.
Dataset usage
For the usage of images, we downsample images to a size of 256x768.
For the training set, we augment the dataset by randomly sampling 16 frames from a clip.
For validation and test sets, we set a fixed sampling strategy to make sure the same image frames are used.
Please refer to the TACO dataset for more details.
TACO is he largest dataset for atomic activity recognition.
The dataset is collected in the CARLA simulator, enabling diverse and balanced atomic activity class distribution.
64 classes of atomic activity, by combining 12 actions of vehicles and 8 actions of pedestrians and agent types: car (C), motorcyclist/bicyclics (K), pedestrians (P), group of cars (C+), group of motorcyclist/bicyclics (K+), and group of pedestrians (P+).
Scenarios are collected in either a 4-way intersection or a T-intersection.
5178 clips with 16521 activity labels.
The average length of the videos is 109.341 frames with a frame rate of 20Hz
We collect image and instance segmentation data with a size of 512 × 1536 pixels, and subsequently downsample each frame to 256 × 768 as input to the models.