RGB-Thermal (RGB-T) object tracking receives more and more attention due to the strongly complementary benefits of the thermal information to the visible data. However, the related research is limited by a comprehensive evaluation platform. In this paper, we contribute a video benchmark dataset for the RGB-T tracking purpose. It has three major advantages over existing ones: 1) Its size is sufficiently large for large-scale performance evaluation (total frame number: 233.8K, maximum frame per sequence: 8K). 2) The alignment between RGB-T sequence pairs is highly accurate, which does not need pre- and post-processing. 3) The occlusion levels are annotated for analyzing the occlusion-sensitive performance of different tracking algorithms. Moreover, we propose a novel graph-based approach to learn a robust object representation for RGB-T tracking. In particular, the tracked object is represented with a graph with image patches as nodes. This graph is dynamically learned in a single unified optimization framework from two aspects. First, the graph affinity is optimized based on the weighted sparse representation, in which the modality weight is introduced to leverage RGB and thermal information adaptively. Second, each graph node (i.e., image patch) weight is propagated from the initial ones along with graph affinity. The optimized patch weights are then imposed on the extracted RGB and thermal features,and the target object is finally located by adopting the structured SVM algorithm. Extensive experiments on both public and newly created datasets demonstrate the effectiveness of the proposed tracker against several state-of-the-art tracking methods.
The full benchmark contains 234 RGB-T video sequence paris.
Attr Description
NO No Occlusion - the target is not occluded.
PO Partial Occlusion - the target object is partially occluded.
HO Heavy Occlusion - the target object is heavy occluded (over 80% percentage).
LI Low Illumination - the illumination in the target region is low.
LR Low Resolution - the resolution in the target region is low.
TC Thermal Crossover - the target has similar temperature with other objects or background surroundings.
DEF Deformation - non-rigid object deformation.
FM Fast Motion - the motion of the ground truth between two adjacent frames is larger than 20 pixels.
SV Scale Variation - the ratio of the first bounding box and the current bounding box is out of the range [0.5,1].
MB Motion Blur - the target object motion results in the blur image information.
CM Camera Moving - the target object is captured by moving camera.
BC Background Clutter - the background information which includes the target object is messy.
The RGBT234 dataset can be downloaded through the link: Download Dataset. In order to people that does not access Google cloud disk can load the dataset, we also share the dataset in Baidu cloud disk. The download link is https://pan.baidu.com/s/1naq87OmHz2c_GrtOdFCpgQ.
The Accuracy (A), Robustness(R),and Expected Average Overlap (EAO) of these evaluated trackers.