COB-3D-v2 Dataset Format

This page described the format of the COB-3D-v2 dataset (download)


The tar file contains the following directory structure:

v2/

  |-- dset.json

  |-- scenes/

            |-- {scene_id}.npz

  |-- meshes/

            |-- {mesh_id}.stl



The dataset split is contained in v2/dset.json. This is a json file in the format: 

  {

"train": [

"scene_id_1", 

"scene_id_2", 

...,

],

"val: [

  "scene_id_X",

  ...,

      ],

  }


There are 6955 scenes in total (6259 train, 696 val). 

Each scene is contained in an NPZ file: v2/scenes/{scene_id}.npz

The structure of each NPZ file is outlined below. The mesh_ids, obj_poses, voxel_grid fields were added since the v1 release to enable training shape completion models.



v2/scenes/{scene_id}.npz:

  |-- rgb:          The rendered RGB image. 

                     Shape (3, H, W), dtype float32, values scaled to [0, 1].


  |-- intrinsic:    The camera intrinsics. 

                     Shape (3, 3), dtype float32.


  |-- depth_map:    The rendered depth map corresponding to `rgb`. 

                     Shape (H, W), dtype float32.


  |-- normal_map:   The rendered normal map corresponding to rgb`. 

                     Shape (3, H, W), dtype float32.

  |-- near_plane: The minimum depth value of the scene's working volume.

Scalar, float32.

  

  |-- far_plane: The maximum depth value of the scene's working volume.

Scalar, float32.


  |-- segm/

  |-- boxes: 2D bounding boxes for each object in the scene.

Shape (N_objects, 4), dtype float32.

These are pixel coordinates relative to `rgb`.

The box format is `[x_low, y_low, x_high, y_high]`


          |-- masks:        Binary masks for each object in the scene. 

                             Shape (N_objects, H, W), dtype bool.


          |-- amodal_masks: Amodal instance masks for each object.

                             Shape (N_objects, H, W), dtype bool.


  |-- bbox3d/

            |-- poses: The pose of each object's 3D bounding box, as a 4x4 matrix.

      This is the transform from the bbox frame to the camera frame.

      Shape (N_objects, 4, 4), dtype float32.


            |-- dimensions: The dimensions of each 3D bounding box.

Shape (N_objects, 3), dtype float32.


            |-- corners: The corner points of each 3D bounding box, in the camera frame.

Shape (N_objects, 8, 3), dtype float32.



  |-- mesh_ids: The mesh_id of each object.

These correspond to the STL files in `v2/meshes/{mesh_id}.stl`

List[str], length N_objects.


  |-- obj_poses/

      |-- poses: The pose of each mesh, as a 4x4 matrix.

      This is the transform from the mesh frame to the camera frame.

Note that the mesh frame does not necessarily equal the bbox frame!

      Shape (N_objects, 4, 4), dtype float32.


      |-- scales: The scale of each mesh.

      Shape (N_objects, 3), dtype float32.


  |-- voxel_grid/

|-- voxels: The surface of each mesh, extracted into a voxel grid.

Shape (N_objects, n_voxels, n_voxel, n_voxels), dtype bool.


|-- extents: The extents of each object's voxel grid.

`voxel_grid/voxels[i]` span the cuboid `[-extents[i], extents[i]]`,

in the object frame `obj_poses/poses[i]`.

      Shape (N_objects, 3), dtype float32.