With the advantage of high mobility, Unmanned Aerial Vehicles (UAVs) are used to fuel numerous important applications in computer vision, delivering more e ciency and convenience than surveillance cameras with xed camera angle, scale and view. However, very limited UAV datasets are proposed, and they focus only on a specific task such as visual tracking or object detection in relatively constrained scenarios. Consequently, it is of great importance to develop an unconstrained UAV benchmark to boost related researches. In this paper, we construct a new UAV benchmark focusing on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking. Then, a detailed quantitative study is performed using most recent state-of-the-art algorithms for each task. Experimental results show that the current state-of-the-art methods perform relative worse on our dataset, due to the new challenges appeared in UAV based real scenes, e.g., high density, small object, and camera motion. To our knowledge, our work is the first time to explore such issues in unconstrained scenes comprehensively.
Figure 1. Examples of annotated frames in the UAVDT benchmark. The three rows indicate the object DETection (DET), Multiple Object Tracking (MOT) and Single Object Tracking (SOT)., respectively. The shooting conditions of UAVs are presented in the lower right corner. The pink areas are ignored regions in the dataset. Different bounding box colors denote different classes of vehicles. For clarity, we only display some attributes.
The proposed UAVDT benchmark consists of 10 hours of raw videos, from which 100 video sequences of about 80,000 representative frames are selected. The sequences contain between 83 and 2970 frames. The videos are captured by an UAV platform at various urban locations such as squares, arterial streets, toll stations, highways, crossings and T-junctions. The videos sequences are recorded at 30 frames per seconds (fps) with the resolution of 1080*540 pixels.
We annotated about 80,000 frames from the 10 hours raw videos and about 0,84 million bounding boxes over 2,700 vehicles in our dataset. In some areas, the vehicles were too small to analyze their motion or to perform classification on them. For the tracking tasks i.e., MOT and SOT, we define 3 attributes indicating conditions of the scene (Illumination Condition, Flying Altitude and Camera View) and 1 attribute indicating the duration of the sequences (Duration).
Illumination Condition indicates weather conditions, e.g., fog, rain, etc. and the time of the day, e.g., daylight, night. The videos in daylight introduce interference of shadows by the sunshine. In a night scene, under the illumination of dim street lights, objects’ texture information can hardly be captured. The scenes captured in foggy conditions lack details regarding the contours of the objects in the scene.
Flying Altitude indicates the flying height of UAVs when the videos are captured, i.e., low-alt (< 30m), medium-alt (> 30m and < 70m) and high-alt (> 70m), affecting the scale of the objects. Much more details of objects are captured in low-alt while the object may occupy larger area of the scene, so in the scene, the objects are typically much fewer than in higher height. In high-alt, the size of vehicles is much smaller. For example, the scene could contain more than a hundred tiny objects which respectively contain 0:005% pixels of a frame.
Camera View indicates the way the videos have been recorded, i.e., front-view, side-view and bird-view. In frontview, the camera is aligned with the road. In side-view, the UAV camera is positioned off the axis of the road capturing the side of the objects. In bird-view, the video views are perpendicular to the ground. The camera view could change during the sequence so that both front-view and side-view may be present. Duration indicates the length of a sequence. The robustness of methods could be evaluated in the long term sequences. The sequences containing more than 1500 frames, are tagged with the long-term attribute.
For the detection task, we consider the 4 attributes above and label three other kinds of attributes for evaluating the DET methods, i.e., Vehicle Category, Vehicle Occlusion and Out-of-view. Vehicle Category indicates the kind of the vehicle in the scene e.g. car, truck and bus. Vehicle Occlusion is the fraction of the bounding box occluded by other objects or the obstacles in the scenes (e.g., under the bridge). We split the occlusion into four levels, i.e., no-occ (NOC 0%), small-occ (SOC 1%~ 30%), medium-occ (MOC 30%~70%) and large-occ (LOC 70%~100%). Out-of-view is the degree of vehicle parts outside a scene or in the ignored regions. We split it into three levels, i.e., no-out (NO 0%), small-out (SO 1%~30%) and medium-out (MO 30%~50%). The objects with the out-of-view degree larger than 50% are discarded. The distribution of the 3 kinds of attributes in our dataset is shown in Figure 2.
Figure 2. The distribution of attributes in UAVDT.
This dataset is for research purpose only.
If you use this library or the dataset, please cite our paper:If you use the dataset, our results or the source code, please cite our paper:
• D. Du, Y. Qi, H.g Yu, Y. Yang, K. Duan, G. Li, W.g Zhang, Q. Huang, Q. Tian, " The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking", European Conference on Computer Vision (ECCV), 2018.