An Evaluation Perspective in Visual Object Tracking: from Task Design to Benchmark Construction and Algorithm Analysis
Tutorial in 31th IEEE International Conference on Image Processing (ICIP)
Tutorial in 31th IEEE International Conference on Image Processing (ICIP)
The Visual Object Tracking (VOT) task, a foundational element in computer vision, seeks to emulate the dynamic vision system of humans and attain human-like object tracking proficiency in intricate environments. This task has widely applied in practical scenarios such as self-driving, video surveillance, and robot vision. Over the past decade, the surge in deep learning has spurred various research groups to devise diverse tracking frameworks, contributing to advancing VOT research. However, challenges persist in natural application scenes, with factors like target deformation, fast motion, and illumination changes posing obstacles for VOT trackers. Instances of suboptimal performance in authentic environments underscore a significant disparity between the capabilities of state-of-the-art trackers and human expectations. This observation underscores the imperative to scrutinize and enhance evaluation aspects in VOT research.
Therefore, in this tutorial, we aim to introduce the basic knowledge of dynamic visual tasks represented by VOT to the audience, starting from task definition and incorporating interdisciplinary research perspectives of evaluation techniques. The tutorial includes four parts: First, we discuss the evolution of task definition in research, which has transitioned from perceptual to cognitive intelligence. Second, we introduce the principal experimental environments utilized in VOT evaluations. Third, we present the executors responsible for executing VOT tasks, including tracking algorithms and interdisciplinary experiments involving human visual tracking. Finally, we introduce the evaluation mechanism and metrics, comprising traditional machine-machine comparisons and novel human-machine comparisons.
This tutorial aims to guide researchers in focusing on the emerging evaluation technique, improving their understanding of capability bottlenecks, facilitating a more thorough examination of disparities between current methods and human capabilities, and ultimately advancing towards the goal of algorithmic intelligence.
This tutorial focuses on techniques for Visual object tracking (VOT) from an evaluation perspective. The organization of the tutorial is as follows:
We start by discussing the development direction of task definition, which encompasses the original short-term tracking, long-term tracking, and the recently proposed global instance tracking. As the VOT definition has evolved, research has shifted from perceptual to cognitive intelligence. Additionally, we will outline the challenging factors in the VOT task, aiming to assist researchers in comprehending the bottlenecks in actual applications.
Secondly, we will introduce representative experimental environments used in Visual Object Tracking (VOT) evaluations. In contrast to conventional reviews or tutorials that predominantly present datasets chronologically, this tutorial categorizes environments into three distinct groups: general datasets, dedicated datasets, and competition datasets. Each category is introduced separately, aiming to assist researchers in selecting appropriate datasets for subsequent experimental designs.
Thirdly, we will introduce the executors responsible for executing VOT tasks. These encompass not only tracking algorithms, such as traditional trackers, CF-based trackers, SNN-based trackers, and transformer-based trackers, but also incorporate experiments involving human visual tracking within interdisciplinary contexts. We posit that incorporating relevant studies on human dynamic visual abilities can enhance researchers' comprehension of VOT research within interdisciplinary frameworks. Furthermore, furnishing this information enables a comparative analysis between machines and humans, contributing to a more comprehensive evaluation of visual intelligence and a nuanced understanding of the current algorithmic modeling methods.
Fourthly, we will introduce the evaluation mechanism and metrics, encompassing both conventional machine-machine comparisons and innovative human-machine comparisons. Additionally, we analyze the target tracking capability of diverse task executors. Furthermore, we present an overview of the visual Turing test—a human-machine comparison method—highlighting its application across various vision tasks, such as image comprehension, game navigation, image classification, and image recognition. Particularly, we aspire for this tutorial to assist researchers in directing their attention to this emerging evaluation technique, fostering a deeper understanding of capability bottlenecks, encouraging a more profound exploration of disparities between current methods and human capabilities, and ultimately advancing towards the goal of algorithmic intelligence.
Finally, we also indicate the evolution trend of visual intelligence evaluation techniques: (1) designing more human-like task definitions, (2) constructing more comprehensive and realistic evaluation environments, (3) including human subjects as task executors, and (4) using human abilities as a baseline to evaluate machine intelligence. In conclusion, this tutorial summarizes the evolution trend of visual intelligence evaluation techniques for VOT task, further analyzes the existing challenge factors, and discusses the possible future research directions.
📧xinzhao@ustb.edu.cn
Professor, School of Computer and Communication Engineering, University of Science and Technology Beijing (USTB). Xin Zhao received his PhD degree from the University of Science and Technology of China (USTC) in 2013. His research interests include video analysis, performance evaluation, and protocol design, especially for object tracking tasks. He has published international journals and conference papers, such as the IJCV, IEEE TPAMI, IEEE TIP, IEEE TCSVT, CVPR, ICCV, NeurIPS, AAAI, and IJCAI. He is an Associate Editor of Pattern Recognition and the Lead Guest Editor of the International Journal of Computer Vision. He has organized some international workshops and tutorials in conjunction with top-tier conferences on computer vision and pattern recognition.
📧shiyu.hu@ntu.edu.sg
Postdoc Research Fellow, Nanyang Technological University (NTU), Singapore. Shiyu Hu received her PhD degree from the University of Chinese Academy of Sciences in Jan. 2024. She has authored or co-authored more than 15 research papers in computer vision and pattern recognition at international journals and conferences, including TPAMI, IJCV, NeurIPS, etc. Her research interests include computer vision, visual object tracking, and visual intelligence evaluation.
Half-day (about 3 hours). Oct. 27th 2024, Capital Suite – 21 A in Abu Dhabi National Exhibition Centre, Abu Dhabi, United Arab Emirates.
This report welcomes researchers interested in dynamic visual tasks and visual intelligence evaluation techniques to participate. Researchers only need to have basic knowledge of computer vision. We welcome as many audience members as possible, given the conditions allowed by the venue.
Third-year Ph.D. student at Institute of Automation, Chinese Academy of Sciences (CASIA)
fengxiaokun2022@ia.ac.cn
Major in Pattern Recognition and Intelligent System, Computer Vision.
Research on Visual Language Tracking, Multimodal Learning.
Second-year Ph.D. student at Institute of Automation, Chinese Academy of Sciences (CASIA)
zhangdailing2023@ia.ac.cn
Major in Pattern Recognition and Intelligent System, Computer Vision.
Research on Visual Object Tracking, Visual Turing Test.
First-year Ph.D. student at Institute of Automation, Chinese Academy of Sciences (CASIA)
lixuchen2024@ia.ac.cn
Major in Pattern Recognition and Intelligent System, Computer Vision.
Research on Visual Language Tracking, Large Language Model and Data-centric AI.