Zero-Shot Object Detection with Attention

Task

Given a scene image of multiple objects and an image of a target object to search for in the scene, our goal is to draw a bounding box of the target object in the scene. We focus on object detection for robotics tasks and use the YCB Video Dataset. The dataset contains 21 objects and 92 videos with 133,827 frames. For our training dataset, we use frames from one video each containing a single target. Specifically, we use video 48 for training and video 50 for testing. The training set contains the following objects: master chef can, tuna fish can, mug, large clamp, and extra large clamp. The test set contains the following objects: cracker box, tomato soup can, mustard bottle, banana, power drill. We chose these two videos to work with for our task due to the variety of shapes and colors between the training and test video frames.

As we are concerned only with object detection and not classification, we provide no class labels. Our test dataset contains frames from a separate video sequence containing novel objects that have not been seen during training.

YCB dataset scene images

For our target images, we take the 3D models of the dataset objects and capture different poses of each object. We take 3 elevations and 37 azimuths to get 110 poses per object. From here, we sample 5 random poses and generate a target image from each.

YCB dataset object set

Sample target images of cracker box

Left: scene image from train dataset, Right: scene image from test dataset

We feed both the scene and target image into our object detection methods which output a predicted bounding box location of the object. Since the poses of objects can vary in a scene, or parts of objects can be occluded, we vary the poses we provide to networks during training.

Methods

We present one baseline method and two attention-based methods for zero-shot object detection.

Baseline: SIFT Feature Matching

Objects in the videos for our task may be placed anywhere in the 3D space of scenes. As a result, images of the same object may differ in terms of size and orientation between scene images. Since our task is to draw a bounding box around an object in a scene given a target image which may be of different orientation, we decided to use SIFT feature matching for our baseline. SIFT (scale-invariante feature transform) features are image features that are not affected by image size or image orientation.

The method first computes the SIFT keypoints and descriptors of the scene and target image separately. Then, we compare the descriptors of each image and retain the best 50 matches. Using the keypoints corresponding to these descriptors, we compute the homography, then draw a bounding box around the object located in the scene.

Left: target image, Right: scene image

A limitation of this approach as we will see in results, is that if the orientation of the object is completely different as we see above, or there is occulsion in the image, then this baseline fails to draw reasonable bounding boxes.

Our Approach #0

Given a scene image and a target object image, we first proposed to extract multi-channel feature maps from a fine-tuned ResNet-18 network pre-trained on ImageNet. Attention would then be applied in this feature space to generate a multi-channel heat map of the scene. We compute attention through a simple element-wise dot product (Hadamard product) between the two feature maps.

For zero-shot methods to work, a network has to work, a network has to learn to "search" different regions in a scene given some target entity (objects in our case). Hence, we train two networks: a fine-tuned ResNet-18 feature extractor head to generate the feature maps, and a bounding box regressor to predict the bounding boxes from the heat map. During training, we supervised the bounding box regressor using L1 bounding box regression loss. The loss is not back-propagated to the fine-tuned network since we do not want to bias the generated feature maps to the trained instances. We then apply a feature loss on the fine-tuned ResNet-18 network such that the predicted and target object regions lie close to each other in feature space. To do this, we take the predicted bounding box, crop the original scene image using this bounding box, then extract the features for the resulting predicted image. We then compute the L1 loss between the features of the predicted and ground truth images.

In our experiments, we used a bounding box regressor with three fully connected layers with 128, 64 and 4 neurons each. We experimented with having the bounding box network predict either the raw bounding box coordinates, or ones relative to the size of the image (i.e. between 0 and 1). While this approach initially was able to overfit to a small training dataset, when scaling the dataset size up, it was unable to learn anything meaningful. Additionally, we tried experimenting with different optimizers including SGD, Adam, Adadelta, Adagrad and RMSProp, and different loss functions including 1-IOU and Smooth L1 loss, however the network was unable to learn even on the training set.

We hypothesized that this approach was failing for the following reasons:

The attention heat map was not providing any meaningful information.
Predicting a single bounding box from scratch is hard and requires a very good loss function to steer learning in the correct direction.

Due to these reasons, we decided to try out two different approaches: Approach #1 and Approach #2 which are detailed below. Approach #1 tries to improve on Approach #0 by providing a better heat map and Approach #2 explores trying a different bounding box prediction approach.

Our Approach #1

To overcome the center bias of approach #0, we applied aggressive data augmentation strategies. The below architecture was used for this approach.

The input scene image is first augmented heavily using the Albumentations library. The input target images include 15 images with the target object in different orientations. Then, the features for both the scene and all target images are extracted using backbone networks. Different from the previous approach, we use a Resnet-50 Feature Pyramid Network for extracting the scene features and a VGG-19 network for extracting the target features. We observed that a ResNet-18 backbone was not suitable for our task and suffered from poor convergence. Now Fs contains a multi-scale feature map. All the feature maps corresponding to different target images are concatenated along the channel dimension making up Fo (still single scale). The target feature maps are convolved with each scale feature map of the scene features and subsequently upsampled to a common spatial dimension and finally concatenated along the channel dimension. The concatenated features are then passed to the bounding box regressor (g(.)) which is a VGG-19 network. The regressor predicts a single bounding box coordinate in relative units.

As mentioned before, the crux of this whole model design is the convolution/correlation operation. This operation should generate a spike near the spatial location of the object in the scene. This implies that the features for the object in the target and scene should be similar in the high-dimensional latent space.

Following this architecture, we observe that the predictions do not suffer from the center bias anymore.

The above image contains multiple objects of interest. In this case, both the predicted and ground-truth boxes are valid and hence the network predicts one of the bounding boxes corresponding to a whole object in the scene.

Further results are shown in the Results page.

Our Approach #2

We hypothesized that having a network predict a single bounding box in both a training setting and a zero-shot setting is difficult. In a training setting, due to the variety of bounding box sizes and locations around an image, it may be hard for a network to learn what a valid bounding box is (i.e. one whose width and width are not 0, and one that does not exceed the boundaries of the image). Additionally, we found that a large number of bounding boxes in our training dataset were centered in the scene image. As a result, in order to minimize expected loss, Approach #1 without data augmentation was biased towards predicting bounding boxes near the center of images. In a zero-shot setting, due to the potential major differences in scene colors, lighting, background, and objects, the network may just predict random noise instead of meaningful bounding box coordinates.

To address this, we reasoned that a pre-trained network, despite not seeing an object before, should still be able to understand what is an object versus background, and propose reasonable boxes. So, we decided to use a pre-trained region proposal network to predict multiple boxes, then feed these predictions into a bounding box regressor to hopefully provide extra support in predicting a final bounding box. The architecture is illustrated below.

We use a FasterRCNN network pre-trained on ImageNet to extract bounding box proposals of object from the scene image and their corresponding feature maps. Since FasterRCNN uses a ResNet50, we use a ResNet50 encoder pre-trained on ImageNet to extract our target features. The output dimensions of the per-box RPN features are of 16x2000x7x7 so we apply a global average pooling on the feature maps to obtain a 16x2000 matrix. After normalization of both the bounding box and target image features, we compute the dot product between the features to obtain an attention map. We then apply a softmax on the attention map to obtain a "ranking" of which features best match the target image features. Then, we concatenate the "scores" and the bounding box proposal coordinates and feed this to a bounding box regressor containing 4 linear layers with 4069, 4096, 2048, and 4 neurons respectively. Our intuition behind concatenating the bounding box proposal coordinates and the attention scores is in hoping the bounding box regressor can learn to draw information from the highest scoring proposal.

During training we once again experimented with the same optimizers in Approach #0 and used L1 loss. Gradients were back-propagated through both the ResNet50 encoder and the Faster RCNN network.

We find that this method, during training is able to learn to predict reasonable bounding boxes all around the image. For example, below, we have a prediction on the training set of the master chef can.

Unfortunately, we found that while qualitative results on the training set was able to achieve bounding boxes like the above, the model was unable to perform at all on the test set. Bounding box coordinates were mostly invalid, with the bounding box x and y coordinates outside the image. As a result, all of the predictions looked like the ones below.

Bounding box predictions on the test set. Note that all the predictions are in the bottom left corner due to the coordinates being invalid or out of the bounds of the image.

While the network detailed above for this approach was able to obtain reasonable training performance, the network was unable to perform in a zero-shot setting. We reasoned that this is because the network was still unable to learn what a bounding box should look like and instead was simply just overfitting to the training set.

Drawing upon the idea that a pre-trained region proposal network should still be able to generate reasonable bounding boxes, we decided to change our task from a regression to a classification one. Instead of predicting bounding box coordinates (or the top left coordinates and width and height), we would instead predict which of the bounding box proposals best matches the ground truth bounding box. We would back-propagate the loss through the selected bounding box and the region proposal network to fine-tune the proposals. The updated architecture is illustrated below.

The archictecture is the same as the previous network, however now the bounding box regressor predicts a scalar which represents the index of the proposal to select as the final bounding box prediction.

While the predicted bounding boxes on the training set may not be as close to the ground truth as the previous proposed network, this revised version avoids bounding boxes being proposed outside of the image, or with invalid widths and heights on the test set. We detail our results for this approach in the Results page.

Page updated

Report abuse