Towards More Flexible and Accurate Object Tracking with Natural Language:

Algorithms and Benchmark

Xiao Wang#1, Xiujun Shu#1,2, Zhipeng Zhang#3, Bo Jiang#4, Yaowei Wang#1, Yonghong Tian#1,5, Feng Wu#1,6

#1 Peng Cheng Laboratory, Shenzhen, China; #2 School of Electronic and Computer Engineering, Peking University, Shenzhen, China

#3 NLPR, Institute of Automation, Chinese Academy of Sciences; #4 School of Computer Science and Technology, Anhui University, Hefei, China;

#5 Department of Computer Science and Technology, Peking University, Beijing, China; #6 University of Science and Technology of China, Hefei, China

[Paper] [Demo Video] [Slides] [Dataset] [Evaluation Toolkit] [SOT-paper-list]

Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language. TNL2K is featured in:

  • Large-scale: 2,000 sequences, contains 1,244,340 frames, 663 words, 1300 / 700 for the train / testing respectively

  • High-quality: Manual annotation with careful inspection in each frame

  • Multi-modal: Providing visual and language annotation for each sequence

  • Adversarial-samples: Randomly adding adversarial samples for research on adversarial attack and defence

  • Significant-appearance-variation: Containing videos with cloth/face change for pedestrian

  • Heterogeneous: Containing RGB, thermal, Cartoon, Synthetic data

  • Multiple-baseline: Tracking-by-BBox, Tracking-by-Language, Tracking-by-Joint-BBox-Language

How to Download TNL2K dataset?

  • Download from BaiduYun: Link: Password: pclt (Note: you need to copy this link and paste to your browser yourself. It will not work if your direct clik this link even with hyperlink)

  • Download from Onedrive: Click [here]

  • Download from GoogleDrive: Click [here]

[Note] This dataset can be only used for research. For the tutorials about the evaluation toolkit, you can find it from this github: The annotations of 12 videos in the training subset are modified for more accurate annotation. Please update these videos with the [new annotations]. For the code of this work, we will release it on the github after we finish the journal extension. If you have any questions about this work, please contact me via or

Motivation and Video Samples:

Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can't reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.

Statical Analysis:

Our Approach:

More Experimental Results:


If you find this work useful for your research, please cite this paper:


author = {Wang, Xiao and Shu, Xiujun and Zhang, Zhipeng and Jiang, Bo and Wang, Yaowei and Tian, Yonghong and Wu, Feng},

title = {Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark},

booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

month = {June},

year = {2021},

pages = {13763-13773}