For our project we explored zero-shot object detection with the task setting of the vision system of a robot arm having to detect a novel object from a scene only given an image of the target. We utilized the YCB dataset and defined the task to be: given image frames from a video, target object images, bounding box labels as training data, predict bounding boxes around novel object in image frames from another video given a target image.
As a baseline we employed SIFT feature matching. This non-deep method is able to provide good results, however it fails when the object in the scene is under different orientation with the target or when it is occluded.
In Approach #0, we build a deep neural network that contains a feature net and a bounding box net and apply a simple dot product between the scene feature and target feature to compute an attention heat map. The network was able to overfit to small training data, however it fails to generalize to a larger size. We reasoned that this is due to the attention map not providing useful information, and/or the difficulty to predict a bounding box on the scene image from scratch.
In Approach #1, to reduce the center bias that we observed from the output of the bounding boxes from Approach #0, we applied aggressive data augmentation and then apply a convolution operation to produce the spatial heat map. By adopting this method, we successfully removed the center bias and achieved good results in some test scenes. However the model suffers when the object is under a different pose that is not covered by our sampled poses when generating target images, or when the object is occluded.
In Approach #2, we employed a pre-trained region proposal network to help suggesting meaningful bounding boxes on the scene image. The model would predict an index for the proposals based on the matching scores computed from the features of the cropped region and the target image. We tried having the model predicting a raw index from the bounding box proposals as well as predicting a number within 0 to 1 and scaling it to an integer index. The resulting network achieved an IOU of 0.08 which is an improvement from the baseline. We observe that when the network is able to predict the general location of the target object we can achieve a high IOU between the prediction and the ground truth. On the flip side when it fails to locate the target object the prediction tends to be off. We reasoned that this could be due to the target image not being descriptive enough to distinguish similar objects therefore we might need extra supervision.
Based on the experimentations we have done for our current approaches, here are our ideas for future exploration to improve results.
In addition to bounding boxes labels, the YCB dataset that we used for this project also provides segmentation labels for different objects. One future direction we could try is to add extra supervision by using the segmentation labels and predicting a mask for a novel object in a novel scene. The segmentation labels would provide more information regarding the object's shape, orientation and occlusion. As we reasoned that one of the causes for the failure cases from our approach is the different poses with which objects are appearing in the scene, hopefully using segmentation labels would help alleviate the issue.
RGB image
Segmentation label
Inspired by the same idea, we could also directly use pose parameters that are provided by the YCB dataset, which include the transformation matrix for each object of interests appearing in the scene. For now we are pairing each scene image with target image generated from 3D mesh with uniformly sampled poses during training and testing. Utilizing the pose labels could add extra supervision and hopefully would provide better attention maps when matching scene features with target features.
As seen from the example from our Approach #2, different objects (such as the banana and the mustard bottle) could share similar looks and features under certain orientation and position in the scene. This is inevitable when we only have 2D information about the target object. In future work, we could try to directly utilize the 3D object mesh that comes with the YCB dataset to provide more information about the target objects. Although in real world scenarios target object information in the form of 3D mesh might not be as available as 2D images, it would be interesting to see if we can achieve a better result with 3D mesh in this zero-shot detection task.