16-824 Visual Learning and Recognition Project
https://sites.google.com/andrew.cmu.edu/vlr-team11-zero-shot-detection/
For real world tasks, computer vision systems must be able to reason about their environment beyond what their underlying models were trained on. Consider a human warehouse worker who picks and places items into their corresponding package. Say they are picking objects off a conveyor belt and are given a single target image of the object of interest. Even if the worker has not seen the object before, they are still able to identify the object on the conveyor belt and pick it up correctly. Similarly, an object recognition system for a warehouse robot for example, should be able to perform the same task. We frame this task as zero-shot object detection where we seek to train models that are able to locate a target object it has never seen before until an image at test time is provided. In computer vision, attention has also led to great performance increases in reasoning about images as a whole and across time. Much prior work has focused on textual descriptions of a target object however we aim to provide a visual description and utilize attention to help the model identify salient features between the target and camera view images.
Factory robots often need to grasp objects from a pile without apriori knowledge of which object to be picked