BoW-based Image/Scene Classification with Naive Bayes Classifiers/SVMs
Scene classification can be useful in automatic white balance, scene recognition as well as content based image indexing, inquiry and retrieval. It can also assist the depth generation in 2D to 3D conversion of images, which plays critical role in application of 3D TV products. Actually similar to visual object search or recognition, there is the notorious "semantic gap" problem in image/scene classification (called by some peers "category" or "generic object" recognition). It is more challenging than the former, an unsolved problem.
In the experiment, the MIT scene categories dataset is used, which contains 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. There are 2600 color images, 256x256 pixels. All the objects and regions in this dataset have been fully labeled. There are more than 29.000 objects. When dividing the labelled data into separate training and testing subsets, a common rule of thumb is to use 70% of the database for training and 30% for testing. (The "unbalanced data" issue is discused below.)
Fig. 1, MIT Database (8 outdoor scene categories: coast, mountain, forest, street,
open country, inside city, tall buildings and highways) for scene classification
Similar to specific object (face, car, animal or building) recognition, generally there are two sets of representation and recognition algorithms in this field: part-based and bags of features (appearance-based and feature-based). Here Bag of Visual Words concept is applied. Similar to terms in a text document, an images has local interest points or keypoints defined as salient image patches (small regions) that contain rich local information of the image. Denoted by small crosses in the three images in Fig. 2 [2], keypoints are usually around the corners and edges in image objects. Images can be represented by sets of keypoint descriptors, but the sets vary in cardinality and lack meaningful ordering. This creates difficulties for learning methods (e.g., classifiers) that require feature vectors of fixed dimension as input. Use of the vector quantization (VQ) technique clusters the keypoint descriptors in their feature space into a large number of clusters using the K-means clustering algorithm and encodes each keypoint by the index of the cluster to which it belongs. Each cluster as a visual word that represents a specific local pattern shared by the keypoints in that cluster is saved. Thus, the clustering process generates a visual word vocabulary describing different local patterns in images. The number of clusters determines the size of the vocabulary, which can vary from hundreds to over tens of thousands (this is also a critical argument for the total classification performance, usually there is an optimal number dependent on the training data). Mapping the keypoints to visual words, we can represent each image as a "bag of visual words". This representation is analogous to the bag-of-words document representation in terms of form and semantics. Both representations are sparse and high-dimensional, and just as words convey meanings of a document, visual words reveal local patterns characteristic of the whole image. The bag-of-visual-words representation can be converted into a visual-word vector similar to the term vector of a document. The visual-word vector may contain the presence or absence information of each visual word in the image, the count of each visual word (i.e., the number of keypoints in the corresponding cluster), or the count weighted by other factors.
Fig. 2, Bag of Visual Words from extracted patch-based visual features
The SVM (support vector machines) classifier finds a hyperplane which separates two-class data with the maximal margin [3]. It is a statistical learning method to realize the structural risk minimization rather than empirical risk minimization only, taking into account the VC dimension (structutal complexity). For given observations X, and corresponding labels Y which takes values +/-1, one finds a classification function: f (x) = sign(wTx + b) where w, b represents the parameters of the hyperplane. The margin is defined as the distance of the closest training point to the separating hyperplane and its maximization can be formulated as a constrained optimization problem solved by Lagrange multipliers. Then, it can be transformed to be a dual formulation in terms of those Lagrange multipliers.
Data sets are not always linearly separable. The SVM takes two approaches to cope with this problem. Firstly it introduces an error weighting constant C which penalizes misclassification of samples in proportion to their distance from the classification boundary (slacking variables), called soft-margin. Secondly a mapping F is made from the original data space of X to another feature space. This second feature space may have a high or even infinite dimension. One of the advantages of the SVM is that it can be formulated entirely in terms of scalar products in the second feature space, by introducing the kernel K(u, v) = F(u)F(v), known as "kernel trick". Both the kernel K and penalty C are problem dependent and need to be determined by the user. The support vectors are those feature vectors lying nearest to the separating hyperplane, which Lagrange multipliers are bigger than zero.
In order to apply the SVM[6] to multi-class problems, take the one-against-all approach, where each problem discriminates a given class from the other classes (Note: other ways could be the all-against-all method to compare each class with each other class, the error-correcting output-coding method that gives each class a codeword as well as the generalized coding method, and the hierarchical classification etc.). Given an m-class problem, train m SVM's, each distinguishes images of some category i from images of all the other m-1 categories j that is not equal to i. Given an unknown image, assign it to the class with the largest SVM output.
In this case, the input features are the binned histograms formed by the number of occurrences of each keypoint from the vocabulary in the image. Scaling is important before applying SVM. It is recommended linearly scaling each attribute to the range of [-1,1] or [0,1]. Kernel selection is a critical issue. Usually RBF is the first choice, however it has fewer numerical difficulties. When the number of features is very big, one may choose the linear kernel. For SVM, there are several parameters to choose. Cross validation provides the procedure to find the best parameters. In a v-fold cross validation, the training data can be divided into v subsets of equal size, call "leave-one-out-cross-validation". Sequentially one subset is tested using the classifier trained on the remaining v-1 subsets. The cross validation accuracy is the percentage of data which are correctly classified. The main goal of cross valiadation is prevention of the overfitting problem (i.e. learn the irrelevant details of the data and eventually its noise). Over-fitting implies poor generalization to correctly classify new data. Additionally, sensitivity to noise and computational complexity may increase with the dimension of the feature space. This problem is known as the curse of dimensionality. Grid search is a straight forward method for cross validation, somehow naive. The settings for SVM classifiers are the penalty parameter (C) and sigma for kernel (nonlinear). However, when C gets too big, it may result in overfitting too though permitting more support vectors falling into the margin; on the contrary, underfitting may occur if C is set too small. Be reminded the dictionary size (dimension of codebook) is carefully set.
Many datasets encountered in computer vision application areas are unbalanced, i.e. one class contains a lot more examples than the other. Unbalanced datasets can present a challenge when training a classifier and SVMs are no exception [16]. The "stupid" method is the majority class classifier, mostly useless. Usually there are the following types of methods to handle this issue: 1. resampling data, 2. modification of existing learning algorithms (cost sensitive learning, one class classifiers and ensemble classifiers), 3. measuring the classifier performance in imbalanced domains, 4. relationship between class imbalance and other data complexity characteristics. To correct for the imbalance in the data we need to assign different costs for misclassification to each class in SVM. Assuming that the number of misclassified examples from each class is proportional to the number of examples in each class, we choose C+ and C- such that (C+)n+ = (C-)n-, where n+(n-) is the number of positive (negative) examples. This provides a method for setting the ratio between the soft-margin constants of the two classes,leaving one parameter that needs to be adjusted.
(Note: There are two popular methods worth to mention, decision tree[15] and neural network[10], compared with SVM. Decision tree uses a tree-like graph or model of decisions. Decision tree learning is the construction of a decision tree from class-labeled training tuples. The tree models could be either regression tree or classification tree. There is a popular ensemble of decision trees, called random forrest[9]. Decision tree is efficient to process large amounts of training data. To avoid overfitting in decision tree, tree either pre or post pruning is used. Decision tree works like a flowchart, instead neural network is more of a "black box", very hard to know how it makes decision in classification. When looking at a decision tree, it is easy to see that some initial variable divides the data into two categories and then other variables split the resulting child groups. This information is very useful to the researcher who is trying to understand the underlying nature of the data being analyzed. If a challenge is made to a decision based on a Neural Network, it is very difficult to explain and justify to non-technical people how decisions were made. Binary categorical input data for neural networks can be handled by using 0/1 (off/on) inputs, but categorical variables with multiple classes (for example, marital status or the state in which a person resides) are awkward to handle. Besides, if the goal is to produce a program that can be distributed with a built-in predictive model, it is usually necessary to send along some additional module or library just for the neural network interpretation. In contrast, once a decision tree model has been built, it can be converted to if…then…else statements that can be implemented easily in most computer languages without requiring a separate interpreter. Likewise, Neural network handle overfitting by weight decay, weight elimination and optimal brain damage (OBD) etc. In the well-known Multiple Layer Perceptron, i.e. number of layers >=3, the number of hidden units is critical: too few units prevents from adequately fitting the data, too many units result in overfitting. Recently deep learning with neural network, like ConvNets, being viewed as the hierarchical feature learning model [14], has been successful for progressively learning multi-levels of visual patterns.)
Fig. 3, SVM
Fig. 4, NN-MLP
Fig. 5, Classification tree
Naive Bayes [4] is a simple classifier used often in text categorization. It can be viewed as the maximum a posteriori probability classifier for a generative model in which: 1) a document category is selected according to class prior probabilities; 2) each word in the document is chosen independently from a multinomial distribution over words specific to that class. While independence is a naive assumption, the accuracy of the Naive Bayes classification is typically high [4]. In applying Naive Bayes classifiers, the "zero frequency" problem is solved by smoothing technique, such as Laplacian estimation. In training a Naive Bayes classifier, the task is to estimate class prior probabilities and probabilities of the data given the class, where the settings include the degree of smoothing and the number of bins to use when discretizing continous features and possibly more.
(Note: Usually people would mention logistic regression[6] in choosing Naive Bayes, the former directly estimate the likelihood of data given the class, i.e. a discriminative classifier and the latter estimate the data probability and the posterior given the data, i.e. a generative classifier. When large training data is available, logistic regression is better; instead, naive Bayes outperforms. On one hand, if the conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than logistic regression, so need much less training data; even if the assumption doesn't hold, a Naive Bayes classifier still often performs surprisingly well in practice. On the other hand, logistic regression is suggested if want a probabilistic framework (e.g., to easily adjust classification thresholds) or if expect to receive more training data in the future to be able to quickly incorporate into the classifier. BTW, logistic regression cannot easily handle categorical variables nor is it good for detecting interactions between variables. Instead, classification trees are well suited to modeling target variables with binary values, but – unlike logistic regression – they also can model variables with more than two discrete values, and they handle variable interactions. Like SVM, logistic regression can apply regularization and kernel trick too. However, SVM required less variables than Logistic Regression to achieve an equivalent misclassification rate. SVM's loss function is different, related to the maximal margin theory.)
Fig. 6, Naive Bayes Classifier
Fig. 7, Logistic function for P(Y|X)
The main steps of scene classification are [1-2]: Detection and description of image patches; Assigning patch descriptors to a set of predetermined clusters (a vocabulary) with a vector quantization algorithm; Constructing a bag of keypoints, which counts the number of patches assigned to each cluster; Applying a multi-class classifier, treating the bag of keypoints as the feature vector, and thus determine which category or categories to assign to the image. For supervised learning with two possible classes, all measures of performance are based on true positives, false positives, true negatives and false negatives (reflected in confusion or contigency matrix). Below an example of confusion matrix for GIST descriptors-based method[6] is given. Depending on application, different performance metrics are computed from those entries, such as accuracy, precision, recall (hit rate). To be simple, precision=TP/(TP+FP), recall=TP/(TP+FN), accuracy=(TP+TN)/(TP+TN+FP+FN). The ROC (receiver operating characteristic) curve is used for classifier performance evaluation as well, i.e. plotting the fraction of true positives (TPR = TP/(TP+FN), i.e. true positive rate) vs. the fraction of false positives (FPR = FP/(FP+TN), i.e. false positive rate), at various threshold settings. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.
Fig. 8, Flowchart
Note: Recently incorporating the context model, like spatial or object information, is applied to improve the category-level image classification performance, such as spatial information can be coded into the feature extraction: rectangular grids[2], spatial pyramid matching[5], discriminative spatial pyramid[18], GIST (the spatial layout properties, a holistic representation of Spatial Envelope)[6], object-based representation[7-8], spatial BoW[19], semantic context[20], latent semantic analysis (LSA, by SVD)[12]/probabilistic LSA (pLSA, by mixture distribution) for topic discovery, latent Dirichlet allocation (LDA, a generative model)[13] and discriminative LDA[17] etc. The contextual information with interaction/relationship among local image features, image regions, object/scenes, provide beneficial information in disambiguating visual words which often leads to better classification results. The Context could be global, i.e. modeling the spatial layout of image patches(features) or objects, or local, i.e. modeling the relationship of neighboring patches(features) or objects. Then, some researchers apply hybrid context model [17] which build a context aware model to capture both global and local contextual information. Recently deep learning, CNN for example, has been used for scene classification, extended from the generic object recognition (ImageNet classification) [21].
1. G Csurka, C R. Dance, L Fan, J Willamowski, C Bray, Visual Categorization with Bags of Keypoints, ECCV'04.
2. J Yang et. al, Evaluating Bag-of-Visual-Words Representations in Scene Classification, MIR'07.
3. C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning, 20, 1995.
4. Rish, Irina. An empirical study of the naive Bayes classifier, IJCAI Workshop on Empirical Methods in Artificial Intelligence, 2001.
5. S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, CVPR'06.
6. A. Oliva and A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, International Journal of Computer Vision (IJCV), 42, 2001.
7. Li-Jia Li, Hao Su, Eric P. Xing and Li Fei-Fei, Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification, Neural Information Processing Systems (NIPS), 2010.
8. A. Quattoni and A. Torralba, Recognizing Indoor Scenes, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
9. Anna Bosch, Andrew Zisserman, and Xavier Munoz, Image Classification using Random Forests and Ferns, ICCV, 2007.
10. R Socher, C C. Lin, A Y. Ng, C D. Manning, Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML, 2011.
11. Aksoy, S., Koperski, K., Tusk, C., Marchisio, G, Learning bayesian classifiers for scene classification with a visual grammar, IEEE T-Geoscience and Remote Sensing, 43(3), 2005.
12. Bosch A, Zisserman A, Munoz X, Scene classification using a hybrid generative/discriminative approach, IEEE T-PAMI, 30(4), 2008.
13. Elango, P.K., Clustering Images Using the Latent Dirichlet Allocation Model. University of Wisconsin, 2005
14. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep CNN. In: NIPS. (2012) 1106–1114.
15. Breiman, Leo; Friedman, J. H., Olshen, R. A., & Stone, C. J. Classification and regression trees. Monterey, CA: Wadsworth & Brooks, 1986.
16. F. Provost. Learning with imbalanced data sets 101. In AAAI 2000 workshop on imbalanced data sets, 2000.
17. Z Niu, G Hua, X Gao, Q Tian, Context aware topic model for scene recognition. CVPR, 2012.
18. T. Harada, Y. Ushiku, Y. Yamashita, and Y. Kuniyoshi. Discriminative spatial pyramid. In CVPR, 2011
19. Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang. Spatial bag-of-features. In CVPR, 2010.
20. Y. Su and F. Jurie. Visual word disambiguation by semantic contexts. In ICCV, 2011
21. B Zhou, A Lapedriza, J Xiao, A Torralba, A Oliv, Learning Deep Features for Scene Recognition using Places Database, NIPS'14, 2014.