Tracking with Randomized ConvNets

Visual Tracking with Convolutional Random Vector  Functional Link Network

Le Zhang, Student Member, IEEE, P.N. Suganthan, Fellow, IEEE

Fig. 3. Basic structure of the CNN used in this work. “conv” , “pool” , “norm” and “Fc” stand for convolutional layer, pooling layer, normalization layer and fully connected layer, respectively. The ConvNets is randomly initialized without training.

Fig. 1. Basic structure of the CNN used in this work. “conv” , “pool” , “norm” and “Fc” stand for convolutional layer, pooling layer, normalization layer and fully connected layer, respectively. The ConvNet is randomly initialized without training.

Deep neural network based methods have recently achieved excellent performance in visual tracking task. As very few training samples are available in visual tracking task, those approaches rely heavily on extremely large auxiliary dataset such as Imagenet  to pretrain the model. In order to address the discrepancy between the source domain (the auxiliary data) and the target domain (the object being tracked), they need to be finetuned during the tracking process. However, those methods suffer from sensitivity to the hyper-parameters such as learning rate, maximum number of epochs, size of mini-batch and so on. Thus, it is worthy to investigate whether pretraining and finetuning through conventional back-prop is essential for visual tracking. In the present work, we shed light on this line of research by proposing Convolutional Random Vector Functional Link Neural Network (CRVFL), which can be regarded as a marriage of the convolutional neural network (CNN) and rando vector functional link network RVFL), to simplify the visual tracking system. The parameters in the convolutional layer are randomly initialized and kept fixed. Only the parameters in the fully connected layer need to be learned. We further propose an elegant approach to update the tracker. In the widely used visual tracking benchmark, without any auxiliary data, a single CRVFL model achieves 79.0% with a threshold of 20 pixels for the precision plot. Moreover, an ensemble of CRVFL yields comparatively the best result of 86.3%.


Average Precision Score on Visual Tracking Benchmark (CVPR13 version)

Fig. 2. Precision Plot of OPE on Visual Tracking Benchmark (CVPR13 Version)




Kindly cite our work if you like it:

author={L. Zhang and P. N. Suganthan}, 
journal={IEEE Transactions on Cybernetics}, 
title={Visual Tracking With Convolutional Random Vector Functional Link Network},