THOR: Tracking Holistic Object Representations

***Best Science Paper Award at BMVC 2019***

Axel Sauer*, Elie Aljalbout*, Sami Haddadin

* Shared First Authorship

Abstract: Recent advances in visual tracking are based on siamese feature extractors and template matching. For this category of trackers, latest research focuses on better feature embeddings and similarity measures. In this work, we focus on building holistic object representations for tracking. The framework is designed to be used on top of previous trackers with no further need for training. We, therefore, present a new framework for obtaining additional object templates during the tracking process. Since the number of stored templates is limited, our method only keeps the most diverse ones. We achieve this by providing a new diversity measure in the space of siamese features recently introduced in the field. The obtained representation contains information beyond the ground truth object location provided to the system. It is then useful for tracking itself but also for further tasks which require a visual understanding of objects. Strong empirical results on tracking benchmarks indicate that our method can improve the performance and robustness of state-of-the-art trackers while barely reducing their speed.

System Overview

The tracker and THOR can be considered separate components that exchange information. The input image and the initial template image are passed through an encoder (the template image only at the beginning of the sequence), transforming both into feature vectors in an inner product space. The activation maps are then computed with a dot product. For siamese trackers, the encoder is a siamese network and the dot product is a convolution. Over time, THOR accumulates long-term (LT) and short-term (ST) templates. Convolving the accumulated templates with the input image yields two sets of activations maps (corresponding to LT and ST templates). The modulation module calculates a weighted spatial average and multiplies it with all activation maps. Based on these activation maps, the tracker computes the bounding boxes. The box with the highest score in each set is fed into the ST-LT switch which determines which bounding box to use for the prediction. The final prediction is then fed back to the STM and LTM modules to decide whether to keep it or not. The STM also passes the diversity measure δ to the LTM.

Video

Benchmark Results