Dataset Links

Due to space limitations, we have created many versions of the full dataset that can be downloaded separately. Please download whichever best suits your needs. All files are in the .rar format.

Information on dataset structure can be found in the README.txt file.

Terms of Use

The images and annotations in this dataset along with this website belong to UC Berkeley and are licensed under a Creative Commons Attribution 4.0 License.

WISDOM Dataset Objectives

The objectives of introducing the Warehouse Instance Segmentation Dataset for Object Manipulation (WISDOM) are three-fold: (1) it aims to be more applicable to manipulation or warehousing tasks such as grasping or placement, as opposed to existing instance segmentation datasets that consist of natural scenes, (2) it motivates training an instance segmentation network exclusively on simulated depth data, which is easily generated and realistic, and (3) it tests generalization to real images and to different sensor models. The dataset contains a total of 50,000 synthetic depth images and 80 real color and depth images taken of heaps of objects in a bin, each labelled with segmentation masks for every object. The synthetic depth images are taken with camera intrinsics proportional to those of the Photoneo PhoXi camera, a high-resolution depth sensor. The real depth images are taken by the Photoneo PhoXi camera and a Primesense Carmine, a lower-resolution depth sensor, and the real color images are taken by both the Primesense camera and a high-resolution webcam.

Dataset Statistics

The WISDOM-Sim dataset contains 50,000 synthetic depth images and 320k individual ground truth instance segmentation masks generated from 1,600 Thingiverse, KIT, 3DNet, and "packaged" objects. The "packaged" objects are objects from the dataset that are augmented with cardboard backing to mimic common packages. The synthetic images are broken into 40,000 training images, which contain 1,280 objects, and 10,000 test images, which contain 320 objects. The WISDOM-Real dataset consists of 800 RGB-D images of 400 unique heaps containing 3849 instances of 50 real physical objects (25 test objects and 25 training objects), all commonly found in the household and readily available on Amazon. The 400 heaps are broken into 100 training heaps and 300 test heaps (containing the training and test objects, respectively), each of which has low-res and high-res color and depth images.

The real dataset has an average of 4.8 object instances per image, fewer than the 7.7 instances per image in the Common Objects in Context (COCO) dataset and the 6.5 instances per image in WISDOM-Sim, but more instances per image than both ImageNet and PASCAL VOC (3.0 and 2.3, respectively).  Additionally, it has many more instances that are close to, overlapping  with, or occluding other instances, thus making it a more representative dataset for tasks such as bin picking in cluttered environments. Since it is designed for manipulation tasks, most of the objects are much smaller in area than in the COCO dataset, which aims to more evenly distribute instance areas. In the WISDOM dataset, instances take up only 2.28% of total image area in the simulated images, 1.60% of the total image area on average for the high-res images, and 0.94% of the total image area for the low-res images. The figure above compares the distributions of these metrics to the COCO dataset.