Task

Given a scene image of multiple objects and an image of a target object to search for in the scene, our goal is to draw a bounding box of the target object in the scene. We focus on object detection for robotics tasks and use the YCB Video Dataset. The dataset contains 21 objects and 92 videos with 133,827 frames. For our training dataset, we use frames from one video each containing a single target. Specifically, we use video 48 for training and video 50 for testing. The training set contains the following objects: master chef can, tuna fish can, mug, large clamp, and extra large clamp. The test set contains the following objects: cracker box, tomato soup can, mustard bottle, banana, power drill. We chose these two videos to work with for our task due to the variety of shapes and colors between the training and test video frames.

As we are concerned only with object detection and not classification, we provide no class labels. Our test dataset contains frames from a separate video sequence containing novel objects that have not been seen during training.