Note: Copyright (c) 2012 Dr. Yu Huang, Sunnyvale, CA 94089. All Rights Reserved, Sep. 2012.
Online Learning-based Visual Object Tracking
Visual tracking can be regarded as a classification problem, which optimizes the discrimination of the target appearance with its surrounding background. On-line learning and update of both target model and background model facilitates adaptation for both target and background appearance changes. Since the appearance distribution of cluttered background looks multimodal and irregular, a divide-and-conquer method is proposed: the background is spatially decomposed into multiple segments so that the appearance distribution in each segment approximates to be unimodal; then an ensemble of classifiers is constructed based on multiple target-background pairs and weighted according to every classification error. About the characteristic of ensemble classifiers, please refer my old overview report of 'bagging', 'boosting', 'stacking', 'stochastic discrimination' and 'random subspace method' etc.(recently 'random forest' gets popular, derived from them.)
"Bag of patches" for the object and its background is cropped respectively as the source of positive and negative samples for building a discriminative classifier. Then the classifier is used to evaluate a region of interest (around the previous location) in the following frame and build a confidence map for object relocating. To adapt to the object (background as well) appearance variation, the current classifier is updated continuously from the new object location in the new frame.
Lu and Hager has used offline SVM for visual tracking and matting in 2007. The SVM algorithm uses structural risk minimization to find the hyperplane that optimally separates two classes of data. Almost at the same time, Tang et al. proposed to use online SVM (proposed by Cauwenberghs and Poggio in 2000) realize co-training with two different features for tracking. In online SVM learning, the Karush-Kuhn-Tucker (KKT) conditions are maintained for all the previous training data when the old sample is removed or the new sample is added.
Here we also use an online Support Vector Machine (SVM) as the classifier, in which updating of both the target model and the background model is realized by adding new samples and removing the old samples for incremental learning and decremental unlearning. Different from Tang et. al and Lu & Hager, we use emsember of online SVM classifiers based on separation of background in the neighboring. The ensemble of classifiers is trained to learn the discrimination of the target with respect to each background segment at the initialization; the ensemble of SVM classifiers enables the detection of outliers while the SVM applies the slack variables to handle noise disturbance.
In object tracking, classification scores of unlabeled samples in the (motion) predicted target-surround window are mapped to the image coordinates and vote to the final confidence map which local mode is detected by a mean shift method corresponding to the estimated position.
Figure 1: Flow Chart of Online SVM Learning for Tracking.
Figure 2: Segmented Background and Ensemble Training
The popular (offline) Adaboost classifier is widely used for improving the accuracy of any given learning algorithm. Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb. Repeating training with different subsets of training data generates N hypothesis, a weighted voting of those hypothesis can transform a weak learning algorithm into a strong one.
Tieu and Viola introduced boosting methods for feature selection in 2001. The idea is that each feature corresponds to a single weak classifier and boosting selects from the features. In training, the algorithm selects one new feature and adds it to the ensemble at each boosting iteration. All features are evaluated and the best one is selected which forms the weak hypothesis and the weight based on the error. Finally a strong classifier is computed as weighted linear combination of the weak classifiers.
Viola, Platt and Zhang in 2005 argue that the object localization by detection has inherent ambiguities that make it more difficult to train a classifier using traditional supervised methods. For this reason they suggest the use of a MIL (Multiple Instance Learning) approach. The basic idea is that during training, examples are presented in sets or bags and labels are provided for the bags rather than individual instances. If the bag is labeled positive it is assumed to contain at least one positive instance, otherwise the bag is negative. Therefore, the ambiguity is passed on to the learning algorithm, which now has to figure out which instance in each positive bag is the most "correct".
Since we don’t know a priori how the difficult/good a sample is, the on-line boosting algorithm turns to a new strategy to compute the weight distribution. In 2001, Oza proposed an online boosting framework in his thesis. The idea is that the importance of a sample can be estimated by propagating it through the set of weak classifiers. The approach of Oza is not directly applicable to feature selection, his algorithm has no way of choosing the most discriminative feature because the entire training set is not available at one time. Grabner and Bischof proposed a modified method in 2006 which performs feature selection (defined as selectors) by maintaining a pool of M > N candidate weak classifiers. The number of weak classifiers N is fixed at the beginning. One sample is used to update all weak classifiers and the corresponding voting weights. For each classifier, the most discriminative feature for the entire training set is selected from a given feature pool. It is suggested that the worst feature is replaced by a new one randomly chosen from the feature pool. Recently, Babenko, Yang and Belongie in 2009 extended Viola, Platt and Zhang's work in MIL to on-line boosting for visual tracking. Similar to on-line AdaBoost, the weak classifiers are selected from a pool to minimize the loss function, not training error.
Here we propose a online multiple negative modality learning boosting method for visual tracking. Usually multiple poses of the tracked object is handled, like the work by Kim et. al in 2009. Based on the separation of background in neighboring, we take care of multi-moduality of neighboring background distribution in visual tracking. The negative data is split into groups and used for classifier training simultaneously. Similar to online Ada-boosting, gradient boosting is adopted and a fixed number of strong classifiers are trained. The latent variables associated with each negative sample are unknown, but the coordinate descent is run in each phase to add a weak classifier to a given strong classifier.
Figure 3: Flow Chart of Online MIL Boosting for Tracking.
Figure 4: Example of Online Object Tracking (confidence map and samples for FG/BG)
Deep learning and sparse coding can help in feature extraction in object representation. The question is: can it implement more robust and discriminative features against cluttered background, partial occlusion and variations of appearance models? Some latest research work are reported in references [12-14].
1. B Babenko, M-H Yang, S Belongie, Visual Tracking with Online Multiple Instance Learning, IEEE CVPR’09, June, 2009.
2. L Lu, G. Hager, A nonparametric treatment for location/segmentation based visual tracking, IEEE CVPR’07, 2007.
3. F Tang, S Brennan, Q Zhao, H Tao, Co-tracking using semi-supervised support vector machines, IEEE ICCV’07, 2007.
4. Z. Yin, R Collins, Spatial divide and conquer with motion cues for tracking through clutter, BMVC’06, pages 47-56, 2006.
5. H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via online boosting, IEEE CVPR’06, 2006.
6. P. Viola, J. Platt and C. Zhang. Multiple Instance Boosting for Object Dection, NIPS 2005, Vancouver, Canada, 2005.
7. G. Cauwenberghs, T. Poggio. Incremental and decremental support vector machine learning, NIPS’00, 2000.
8. Nikunj C. Oza. Online Ensemble Learning, Ph.D. Thesis, The University of California, Berkeley, CA, 2001.
9. T-K. Kim, T. Woodley, B. Stenger, R. Cipolla, Online Multiple Classifier Boosting for Object Tracking, CUED/F-INFENG/TR631, Department of Engineering, University of Cambridge, June 2009.
10. Y Huang, Overview of Ensemble Classifiers, Technical Report, IFP, UIUC, Dec., 2002.
11. A. Criminisi, J. Shotton, and E. Konukoglu, Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, , MSR-TR-2011-114, 28 October 2011.
12. N Wang, J Wang, D-Y Yeung. Online Robust Non-negative Dictionary Learning for Visual Tracking. ICCV'13, 1-8 December 2013.
13. J Jin, A Dundar, J Bates, C Farabet, E Culurciello, Tracking with Deep Neural Networks, Conference on Information Sciences and Systems (CISS), 2013.
14. N Wang D-Y Yeung, Learning a Deep Compact Image Representation for Visual Tracking, NIPS, 2013.