Indexing & Search with Codebook, Inverted File/Hashing and Re-ranking
Scale Invariant Feature Transform (SIFT) [1], transforms image data into scale-invariant coordinates relative to local features. The SIFT algorithm operates in four major stages to detect and describe local features, or keypoints, in an image: 1. Detection of extrema in scale space (grouped by octave of different resolution, DOG approximation per octave, and extreme search in 26 neighboring pixels) - scale invariant; 2. Sub-unit localization and filtering of keypoints (interpolation and thresholding, Hessian matrix) - low contrast or edge avoiding; 3. Assignment of canonical orientations to keypoints (the highest peak and local peaks within 80% of the highest peaks from 36 bins of the orientation histogram), aligned with the dominant orientation - rotation invariant; 4. Computation of keypoint descriptors (4x4 cells, 8 bins) - normalization for illuminance invariant. SURF (Speeded Up Robust Features)[2], approximates SIFT by relying on integral images for boxlet-based image convolutions, approximation of scale space and Hessian operator. SURF is faster than SIFT, which could outperform in some cases as claimed by its contributors, but mostly works worse due to distortions resulting from approximation of scale space and Hessian matrix.
(Note: Recently there are some new features proposed, like BRIEF - Binary Robust Independent Elementary Features[24], ORB - oriented FAST[25] and rotated BRIEF[21], BRISK - Binary robust invariant scalable keypoints[22], and FREAK - Fast Retina Keypoint[23] etc. Their contribution is that a binary string obtained by simply comparing pairs of image intensities can efficiently describe a keypoint, i.e. patch. )
The keypoint descriptors are highly distinctive, which allows a single feature to find its correct match with good probability in a large database of features. However, in a cluttered image, many features from the background will not have any correct match in the databse giving rise to many false matches in addition to the correct ones. The correct matches can be filtered from the full set of matches by identifying subsets of keypoints that agree on the object and its location, scale, and orientation in the new image. The probability that several features will agree on these parameters by chance is much lower than the probability that any individual feature match will be in error. Here a homography is estimated by keypoint matchings where outliers are handled by RANSAC or LMedS, which is called geometric consistency constraints (related to stereo epipolar constraint and fundamental matrix).
The best candidate match for each keypoint is found by identifying its nearest neighbor in the database of keypoints from training images. The nearest neighbor is defined as the keypoint with minimum Euclidean distance for the invariant descriptor vector. The best algorithms, such as the k-D tree provide no speedup over exhaustive search for more than about 10 dimensional spaces. Therefore, an approximate algorithm, called the Best-Bin-First (BBF) algorithm is applied in the sense that it returns the closest 20 neighbor with high probability [1]. The BBF algorithm uses a modified search ordering for the k-D tree algorithm so that bins in feature space are searched in the order of their closest distance from the query location. This priority search order requires the use of a heap-based priority queue for efficient determination of the search order. An approximate answer can be returned with low cost by cutting off further search after a specific number of the nearest bins have been explored.
Fig. 1, Key Point Matching by Kd-tree with Geometric Consistency Check
Methods for indexing and efficient retrieval with text documents are mature, and effective enough to operate with millions or billions of documents at once. Documents of text contain some distribution of words, and thus can be compactly summarized by their word counts (known as a bag-of-words). Since the occurrence of a given word tends to be sparse across different documents, an index that maps words to the files in which they occur can take a keyword query and immediately produce relevant content.
Fig. 2, Document matched to Codebook
An image is a sort of document, and it contains a set of local feature descriptors. However, at first glance, the analogy would stop there: text words are discrete "tokens", whereas local image descriptors are high-dimensional, real-valued feature points. To do so, we must impose a quantization on the feature space of local image descriptors. That way, any novel descriptor vector can be coded in terms of the (discretized) region of feature space to which it belongs. The standard pipeline to form a so-called "visual vocabulary" consists of (1) collecting a large sample of features from a representative corpus of images, and (2) quantizing the feature space according to their statistics [4]. Often simple k-means clustering is used for the quantization; the size of the vocabulary k is a user-supplied parameter. In that case, the visual "words" are the k cluster centers. Some methods in information retrieval, like stop list to suppress the most frequent visual words or remove the seldom used words and standard vector/term space model[7] for weighting (tf-idf, i.e. term frequency-inverse document frequency), are applied. Once the vocabulary is established, the corpus of sampled features can be discarded. Then a novel image's features can be translated into words by determining which visual word they are nearest to in the feature space (i.e., based on the Euclidean distance between the cluster centers and the input descriptor).
Fig. 3, Visual Object matched to bags of words
Drawing inspiration from text retrieval methods, Sivic and Zisserman [3] first proposed quantizing local image descriptors for the sake of rapidly indexing video frames with an inverted file [6], a simple but effective way to index images efficiently. An inverted file index is just like an index in a book, where the keywords are mapped to the page numbers where those words are used. In the visual word case, we have a table that points from the word number to the indices of the database images in which that word occurs. For example, in the cartoon illustration in Figure 4 [5], the database is processed and the table is populated with image indices in part (a); the words from the new image are used to index into that table in part (b), thereby directly retrieving the database images that share its distinctive words. Retrieval via the inverted file is faster than searching every image, assuming that not all images contain every word. In practice, an image's distribution of words is indeed sparse. Since the index maintains no information about the relative spatial layout of the words per image, typically a spatial verification step is performed on the images retrieved for a given query. Since homography computation is still too costy, weak geometric constraints [11] are applied, by checking consistency of feature layout, such as relative angle, scaling factor and centroid etc. This technique to re-sort the top list of ranked results is called re-ranking[9]. (Note: Another technique to add more query images from the top rank list of the original query, is called query expansion[10]. In visual search, it can be applied with additional constraints, such as geometric verification.)
Fig. 4, Inverted file indexing for fast search
The min-Hash method stores only a small constant amount of data per image. The method represents the image by a sparse set of visual words. This is a weaker representation than a bag of visual words since word frequency information is reduced into a binary information (present or absent). Similarity is measured by the set overlap (the ratio of sizes between the intersection and the union). A min-Hash is a function f that assigns a number to each set of visual words (each image representation). The function has a property that the probability of two sets having the same value of the min-Hash function is equal to their set overlap. The advantage of such a choice of image representation and the similarity measure is that it enables very efficient retrieval. The drawback is that some relevant information is not preserved in the set of visual words (binary) representation.
To estimate the word overlap of two images, multiple independent min-Hash functions fi are used. The fraction of the min-Hash functions that assigns an identical value to the two sets gives an unbiased estimate of the similarity of the two images. To efficiently retrieve images with high similarity, the values of min-Hash functions fi are grouped into s-tuples called sketches. Similar images have many values of the min-Hash function in common and hence have high probability of having the same sketches. On the other hand, dissimilar images have low chance of forming an identical sketch. Identical sketches are efficiently found by hashing. The recall of min-Hash is increased by repeating the random selection of s-tuples k times. A pair of images is a potential match when at least one sketch collision is encountered. The potential matches are typically further verified. The probability of a pair of images having at least one sketch out of k in common is a function of the word overlap. A modified min-hashing method [7] represents an image by a set of weighted visual words (e.g. idf). The similarity function is a set overlap that is weighted by the word weights, i.e. words with low weight (common to most of the documents) contribute to the similarity less than rare, discriminative words. Also a weighted histogram intersection score is introduced, which is able to take the term frequency into account.
However, min-Hash is NOT suitable for image retrieval (with the exception of near duplicates), because of the low recall (sensitivity). A cluster of images is defined as a set of images with related content. In such a cluster, there are many image pairs depicting the same object. Since each such a pair has certain (even as low as 3%) probability of retrieval, the probability that not a single image pair is retrieved from a cluster quickly drops with the size of the cluster. Aly et al. [8] give some testings and evaluations of min-hashing and inverted file in fast image indexing & retrieval. LSH and min-hashing are not tractable for large scale dataset.
A practical visual indexing and search system has to be trade-off of search quality and efficiency with the consideration of memory requirement of the indexing structure. As a matter of fact, LSH (locality sensitive hashing) typically requires 100-500 bytes per decriptor to index, which is not tractable for large scale dataset [13]. How to learn the codebook and assign a vector in quantization? There are some concerns known to peers[12]: 1. the number of samples required to learn the qunatizer is huger; 2. the complexity of the algorithm itself is prohibitive; 3. the amount of computer memory available is not sufficient to store the floating point values representing the centroids. In summary, to build a visual indexing and search system there are some important issues: 1. User interfaces: query by keywords, sketch, examples and marked objects; 2. Scalability of indexing: large scale video database and storage of metadata;3.Efficiency of search: load balancing and parallelism; 4. Quality in search: multimodal fusion, query expansion, re-ranking and relevance feedback.
Fig. 5, Retrieval in Zurich's building database
Recently, there are some new methods proposed to improve visual object indexing and search performance:
1. Product quantization [12] decomposes the feature space into a Cartesian product of low dim. subspaces and quantize each subspace separately;
2. Hamming embedding [13] adds binary signatures to refine visual words (defining a single partition of the feature space and using the Hamming-distance for similarity metric);
3. Spectral hashing[14] does graph partition by spectral method (thresholded eigenvectors of graph Laplacian) and Semantic hashing [15] seeks compact binary codes with semantic similarity by learning a deep generative model;
4. Fisher kernel[16] transforms a set of independent samples into a vector representation (few visual words are required in Fisher kernel), as a combination of generative and discriminative classifiers;
5. Geometric hashing[17] considers location context for hashing;
6. Visual phrase[18] works by bundling/aggregation of visual words;
7. Sparse coding[20] uses sparsity of vector space to make efficient vector quantization;
8. Non-Euclidean manifold learning[19] (Grassmann or Riemann) applies for building a compact codebook;
9. Deep learning can learn discriminative features for retrieval as high-level visual descriptor, such as Deep AutoEncoder and CNN [26, 27]-based.
1. Lowe, D. G., Distinctive Image Features from Scale-Invariant Keypoints, IJCV, 60, 2, pp. 91-110, 2004.
2. H Bay, A Ess, T Tuytelaars, L Van Gool SURF: Speeded Up Robust Features , Computer Vision and Image Understanding (CVIU), 110, 3, pp. 346--359, 2008
3. Sivic, J. and Zisserman, A.Video Google: A Text Retrieval Approach to Object Matching in Videos, ICCV, 2003
4. Feifei Li, R Fergus, A Torralba Recognizing and Learning Object Categories, Short Course in ICCV, 2005
5. K Grauman and B Leibe, A.Visual Object Recognition, Synthesis Lectures on Artificial Intelligence and Machine Learning, April 2011, Vol. 5, No. 2, Pages 1-181.
6. J. Zobel and A. Moffat. Inverted files for text search engines, ACM Computing Surveys, 38(2):6, 2006.
7. O. Chum, J. Philbin, A. Zisserman, Near Duplicate Image Detection: min-Hash and tf-idf Weighting, BMVC, 2008.
8. M Aly, M Munich, P Perona, Indexing in Large Scale Image Collections: Scaling Properties and Benchmark, IEEE Workshop on Applications of Computer Vision (WACV), Hawaii, January 2011.
9. Hsu, W.H., Kennedy, L.S. ; Shih-Fu Chang, Reranking Methods for Visual Search , IEEE Multimedia Magazine, 14(3), 2007.
10. Chum, O. , Philbin, J. , Sivic, J. , Isard, M. and Zisserman, A, Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval, ICCV'07, Brazil, 2007.
11. Philbin, J. , Chum, O. , Isard, M. , Sivic, J. Zisserman, A., Object Retrieval with Large Vocabularies and Fast Spatial Matching, IEEE CVPR, 2007.
12. Jegou, H., Douze, M., Schmid, C., Product Quantization for Nearest Neighbor Search, IEEE T-PAMI, 33(1), 2011.
13. H. Jegou, M Douze, C Schmid. Hamming embedding and weak geometric consistency for large scale image search, European Conference on Computer Vision, Oct 2008.
14. Y. Weiss, A. Torralba, R. Fergus. Spectral Hashing, Advances in Neural Information Processing Systems, 2008.
15. R. R. Salakhutdinov and G. E. Hinton, Semantic hashing, SIGIR workshop on Information Retrieval and applications of Graphical Models, 2007.
16. F. Perronnin, J Sanchez, T. Mensink, Improving the fisher kernel for large-scale image classification, ECCV'10.
17. Chum, O., Perdoch, M.; Matas, J, Geometric min-Hashing: Finding a (thick) needle in a haystack, IEEE Conference on Computer Vision and Pattern Recognition, 2009.
18. Yimeng Zhang, Zhaoyin Jia; Tsuhan Chen, Image retrieval with geometry-preserving visual phrases, IEEE Conference on Computer Vision and Pattern Recognition, 2011.
19. X. Wang, Z. Li, and D. Tao, Subspace Indexing on Grassmann Manifold for Large Scale Multimedia Retrieval, IEEE Trans. on Image Processing, vol.20(9), 2011.
20. Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang, Linear spatial pyramid matching uisng sparse coding for image classification, IEEE CVPR, 2009.
21. E. Rublee, V. Rabaud, K. Konolige, G. Bradski, ORB: An efficient alternative to SIFT or SURF, ICCV, 2011.
22. S. Leutenegger, M. Chli, R Siegwart, BRISK: Binary robust invariant scalable keypoints, IEEE ICCV 2011.
23. A Alahi, R. Ortiz, P Vanderghenst, FREAK: Fast retina keypoint, IEEE CVPR 2012.
24. M Calonder, V Lepetit, C Strecha, P Fua, BRIEF: Binary Robust Independent Elementary Features, ECCV, 2010.
25. Rosten, E., Porter, R., Drummond, T, Faster and better: A machine learning approach to corner detection, IEEE T-PAMI, 2009.
26. Krizhevsky, A. and Hinton, G.E. Using Very Deep Autoencoders for Content-Based Image Retrieval. European Symposium on Artificial Neural Networks, ESANN'11, 2011.
27. A Babenko, A Slesarev, A Chigorin, V Lempitsky, Neural Codes for Image Retrieval, ECCV'14, 2014.
Appendix: MPEG Compact Descriptor for Visual Search (CDVS) standard