Abstract
Convolutional Neural Network (CNN) features have been successfully employed in recent works as an image descriptor for various vision tasks. But the inability of the deep CNN features to exhibit invariance to geometric transformations and object compositions poses a great challenge for image search. In this work, we demonstrate the effectiveness of the objectness prior over the deep CNN features of image regions for obtaining an invariant image representation. The proposed approach represents the image as a vector of pooled CNN features describing the underlying objects. This representation provides robustness to spatial layout of the objects in the scene and achieves invariance to general geometric transformations, such as translation, rotation and scaling. Our approach also leads to a compact representation of the scene, making each image to occupy a smaller memory footprint.
Motivation and Method
In the figure, we can observe two images of the same scene and their deep CNN representation, in particular fc7 output of the Alexnet [NIPS 2012] CNN. The fc7 features are visualized as a 64 X 64 grid. (Please note that, they are re-sized to the size of the images just for visualization). It is very clear that, the set of neurons having higher (towards red) activations are different in both the representations, causing the distance between them to be higher in the 4096D feature space. The objective of our work is to bring these two representations nearer.
The motivation for our approach stems from the observation that, across the different images of a scene, the set of objects that are present is the same. The differences among them are a result of a different view point (causing rotated or translated images) or a different focal length (causing zoomed in /out images), etc. However, the underlying objects are the same in all these instances.
In our approach, we find the object regions present in the image (using the selective search approach [IJCV 2013]) and represent each of them with the (most) semantic features from the hierarchy of the features provided by a CNN. In particular, we have considered the 4096 dimensional fc7 output from the AlexNet to describe the region proposals. We may think of the nodes present in this layer as a set of latent attributes that get fired for a specific type of visual inputs. In order to summarize the visual content present in the image into a compact signature, we max pool the representations belonging to the individual region proposals. This should be understood as keeping the maximum activation of each of the fc7 units in order to quantify the contribution from each of the attributes. After obtaining the pooled representation for each image, we seek a compact representation via approaches like PCA and ITQ [CVPR 2011] encoding.
Experiments and Results
We have experimented with Holidays [ICCV 2008], Oxford5K [CVPR 2007], Paris6K [CVPR 2008] and UKB [CVPR 2006] datasets. The following figures and tables present the results and comparison with the existing works (please note that the references in the tables are w.r.t the CVPRW 2015 paper).
Figure 1. Retrieval performance on various databases with the binarized representations using ITQ.
Downloads:
You can download the proposed representations for the four datasets [here].
Related Publications
To cite our work:
@InProceedings{Mopuri_2015_CVPR_Workshops,
author = {Reddy Mopuri, Konda and Venkatesh Babu, R.},
title = {Object Level Deep Feature Pooling for Compact Image Representation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2015}
}