Zero-Shot Object Detection with Attention

Evaluation

To compare our methods, we present both qualitative and quantitaive results. For quantitiative results, we compute the average IOU of each method over the test set.

Baseline Method Results

We find that our baseline method acheives an average IOU of 0.06 over the test set. While the method works reasonably well in cases where the orientation of the target object directly matches that of the object in the scene, if there is any occulsion or difference in orientation, the method fails to draw a correct bounding box.

An example of a successful bounding box is shown below for the test set. We can see that the orientation of the objects match in the scene and target image and just enough distinct features of the can are visible to be able to draw a correct bounding box.

Another example of a successful case is shown for the cracker box. Here, SIFT feature matching was successful in finding a reasonable bounding box but may have been lucky. The features of the Cheez-It font match, but not on the correct side of the box. Nonetheless, because some of the features were on different sides of the target image box, when computing the homography, the method is still able to find the proper orientation of the box.

However, as mentioned before, orientation and occlusion affect the baseline method's robustness. For example below, since the cracker box's top was not seen in the target image, the method is unable to know that the box is on its side rather than facing up and draws too small of a bounding box.

Additionally, more often than not, the method does not find matching features and the resulting bounding boxes are not meaningful. We found that the method struggled the most for the drill specifically, maybe due to drill's darker lighting in the target image.

Left and Center: bounding box predictions for a mustard bottle, Right: bounding box predictions for a drill

Our Approach #1 Results

The approach described in the 'Our Idea' page 'Approach #1' is trained using Smooth L1 loss and the Adam optimizer. Through experiments, we found that this loss and optimizer provides the most stable training for the network.

Below we showcase the representative successful results on the train(left) and test (right) set. Note, that the test sequence is on an entirely new scene with previously unseen objects and orientations.

Results on train set

Results on test set

Our model suffered a few failure modes as well as shown below. Most of the failure modes are due either the target object being completely occluded in the scene image or because the target object is posed in a very different orientation that is not covered by the set of posed target images given as input to the network. This as such is a limitation of the model since only a handful of of target images (limited by hardware) can be given to the network while not utilizing pose information.

Failure cases on train set

Failure cases on test set

Our Approach #2 Results

We train the network described in 'Our Idea Approach #2' using L1 loss plus the region proposal network's objectness and bounding box regression losses. We found the Adadelta optimizer gave us the most efficient training performance. We use a step learning rate scheduler with initial learning rate of 1e-2, step size of 1 epoch and decay of 0.4. We also experimented with adding 1-IOU to the loss function however we found just L1 loss to be more effective. While we trained the network for 19 epochs, we found that only 1 epoch of fine-tuning achieves the same downstream performance.

We tried two approaches to indexing into the list of bounding box proposals:

Have the network predict a raw index (i.e. 0, 1, 2, ..., 1999)
Have the network predict a number between 0 and 1 and multiply the output by the number of proposals (2000 minus 1 for indexing)

With the first approach of predicting a raw index, when looking at the IOU score, we found that the network achieved 0.22. This is a significant jump compared to the baseline IOU of 0.06. However, upon further investigation of what the network was predicting, we found that it was choosing index 0 all the time. Consequently during training, the network was really only optimizing the bounding box proposals for the first index. While this may work to achieve very good bounding boxes, we found that at test time this resulted in only mustard bottles being predicted.

Success Cases:

When the target object is a mustard bottle, we are able to predict the correct bounding box from all angles.

Failure Modes:

When the target object is not a mustard bottle, the network still predicts the target object is a mustard bottle.

While bounding boxes for the unseen mustard bottle were accurate, the network's inability to predict other objects limits its utility. We hypothesized that if we consistently predict index 0 during training, then we limit the networks expressive power to perhaps pick a better index during test time.

This led us to to the second approach of predicting a number between 0 and 1 then scaling the output to get a final index. To force the network to predict a number between 0 and 1, we avoided clamping the output or applying a sigmoid. Clamping would not give the network any information that an out of bounds prediction was bad. A sigmoid may saturate gradients and with a training batch size greater than 1, the output would not be accurate when using a batch size of 1 for the test dataset. Instead, we added to the loss the distance of the predicted index from 0 if the index was negative or the distance from 1 if the index was greater than 1.

The resulting network achieved an IOU of 0.08 which is an improvement from the baseline method. Looking at qualitative results, we see that the network now is able to predict boxes for all objects. The low IOU, however, is a result of the network not being the most accurate in most cases. Below are visualizations of results.

Success Cases:

When the network correctly predicts the general location of the target image, we have high IOU between the ground truth and prediction.

Failure Modes:

However, when the network is unable to locate the target object, bounding box predictions are either of the wrong object, or completely off.

Examples of bounding boxes around the wrong object.

Examples of bounding boxes that are far from actual objects.

One reason for the poor performance on some images may be the format of the target image we feed into the network. For example, below is the scene and target image of a successful prediction.

Left: scene image, Right: target image

In our task, we crop the target image and resize it to 224x224 to fit the input size of the ResNet50 encoder. While the orientation nor shape of the can matches that of the one in the scene, the network is still able to discern distinct features and drawn from training examples of what a can (although different color, brand, etc.) looks like. However, for the failure case below, the target image may not be as descriptive to the network due to similar objects in the scene.

Left: scene image, Right: target image

The network predicts that the side of the mustard bottle, which is curved and has a reflection near the end, is where the banana is located. This is an understandable mistake as the side of the bottle has the same qualities as the target image. If however, we provided a 3D object mesh instead of a single 2D image, perhaps the network would have been able to differentiate between the two objects. We propose using meshes instead of 2D target images as one step for future work.

Page updated

Report abuse