Given a scene image and a target object image, we first proposed to extract multi-channel feature maps from a fine-tuned ResNet-18 network pre-trained on ImageNet. Attention would then be applied in this feature space to generate a multi-channel heat map of the scene. We compute attention through a simple element-wise dot product (Hadamard product) between the two feature maps.
For zero-shot methods to work, a network has to work, a network has to learn to "search" different regions in a scene given some target entity (objects in our case). Hence, we train two networks: a fine-tuned ResNet-18 feature extractor head to generate the feature maps, and a bounding box regressor to predict the bounding boxes from the heat map. During training, we supervised the bounding box regressor using L1 bounding box regression loss. The loss is not back-propagated to the fine-tuned network since we do not want to bias the generated feature maps to the trained instances. We then apply a feature loss on the fine-tuned ResNet-18 network such that the predicted and target object regions lie close to each other in feature space. To do this, we take the predicted bounding box, crop the original scene image using this bounding box, then extract the features for the resulting predicted image. We then compute the L1 loss between the features of the predicted and ground truth images.
In our experiments, we used a bounding box regressor with three fully connected layers with 128, 64 and 4 neurons each. We experimented with having the bounding box network predict either the raw bounding box coordinates, or ones relative to the size of the image (i.e. between 0 and 1). While this approach initially was able to overfit to a small training dataset, when scaling the dataset size up, it was unable to learn anything meaningful. Additionally, we tried experimenting with different optimizers including SGD, Adam, Adadelta, Adagrad and RMSProp, and different loss functions including 1-IOU and Smooth L1 loss, however the network was unable to learn even on the training set.
We hypothesized that this approach was failing for the following reasons:
The attention heat map was not providing any meaningful information.
Predicting a single bounding box from scratch is hard and requires a very good loss function to steer learning in the correct direction.
Due to these reasons, we decided to try out two different approaches: Approach #1 and Approach #2 which are detailed below. Approach #1 tries to improve on Approach #0 by providing a better heat map and Approach #2 explores trying a different bounding box prediction approach.