ROAD++ is an extension of oure previous realeased dataset ROAD dataset with a even larger number of multi-label annotated videos from Waymo dataset. Such videos span an even wider range of conditions, over different cities located in the United States. Given that ROAD++ encompasses videos from both the United Kingdom and the United States, it can be used as a benchmark not only for action detection models, but also for domain adaptation models. In the future, we plan to further extend it to novel cities, countries and sensor configurations, with the long term goal of creating a even more robust, “in the wild” setting.
ROAD dataset was build upon Oxford RobotCar dataset Dataset. Please cite the original dataset if it useful in your work, citation can be found here. It is released with a paper and 3D-RetinaNet code as a baseline. Which also contains evaluation code.
ROAD++ (the new version itself) is of significant size, as 1000 videos are labelled for a total of ∼ 4.6M detection bounding boxes in turn associated with 14M unique individual labels, broken down into 3.9M agent labels, 4.3M action labels, and 4.2M location labels. The ROAD++ version is also following the same principles of ROAD, especially multi-labels, multi-instance dataset.
ROAD++ is the result of annotating ~55k carefully selected, relatively long-duration (20 second each) videos from the Waymo dataset in terms of what we call road events (REs), as seen from the point of view of the autonomous vehicle capturing the video. REs are defined as triplets E = (Ag;Ac; Loc) composed by a moving agent Ag, the action Ac it performs, and the location Loc in which this takes place. Agent, action and location are all classes in a finite list compiled by surveying the content of the 55k videos. Road events are represented as ’tubes’, i.e., time series of frame-wise bounding box detections.Action annotations for human as well as other road agents, e.g. Turning-right, Moving-away etc.
Agent type labels, e.g. Pedestrian, Car, Cyclist, Large-Vehicle, Emergency-Vehicle etc.
Semantic location labels of the location of agent, e.g. in vehicle lane, in right pavement etc.
198K frames from 1000 videos annotated, each video is 20 seconds long on an average.
track/tube id annotated for every bounding box on every frame for every agent in the scene.
54K tubes/tracks of individual agents.
3.9M bounding box-level agent labels.
4.3M and 4.2M bounding box-level action and location labels.