Learning Target-aware Attention for Robust Tracking with Conditional Adversarial Network

Xiao Wang, Tao Sun, Rui Yang, Bin Luo

School of Computer Science and Technology, Anhui University, Hefei 230601, China

Abstract:

Many of current visual trackers are based on tracking-by-detection framework which attempts to search target object within a local search window for each frame. Although they have achieved appealing performance, however, their localization and scale handling often perform poorly in extremely challenging scenarios, such as heavy occlusion and large deformation due to the following two major reasons. i) They simply set a local searching window using temporal context, which may not cover the target at all and therefore cause tracking failure. ii) Some of them adopt image pyramid strategy to handle scale variations, which heavily relies on the target localization, and thus can be easily disturbed when the localization is unreliable. To handle these issues, this paper presents a novel joint local and scale-aware global search strategy which can simultaneously achieve target localization and scale handling using learned target-driven attention maps which go beyond the popular tracking-by-detection framework. The attention maps is generated by conditional generative adversarial network (CGAN), specifically, the generator of CGAN is based on an encoder-decoder architecture. The encoder contains two branches extracting features from target object and current frames. The decoder then transforms concatenated features into attention maps. Finally, we employ the attention maps to generate the proposals with high-quality locations and scales, and perform object tracking via multi-domain CNN. Our approach is efficient and effective, needs small amount of training data, and improve the tracking-by-detection framework obviously. Extensive experiments have shown the proposed approach outperforms most of recent state-of-the-art trackers on several visual tracking benchmarks, and provides improved robustness for fast motion, scale variation as well as heavy occlusion.

Motivation:

  • How to estimate accurate scale information of target object ?

  • How to sampling high-quality global proposals from global images instead of local proposals from previous tracking result only ?

Network Architecture:

Visualization

Demo Video

【Note】the red BBox is Ground Truth, the blue BBox is our results.

global-Attention_CarScale.avi
global-Attention_Jump.avi
global-Attention_MotorRolling.avi
global-Attention_Woman.avi


Experimental Results

If you find this paper useful for your research, please consider to cite our paper:

@inproceedings{wang2019GANTrack,

title={Learning Target-aware Attention for Robust Tracking with Conditional Adversarial Network},

author={Wang, Xiao and Sun, Tao and Yang, Rui and Luo, Bin},

booktitle={30TH British Machine Vision Conference},

year={2019}

}