Objects across categories are normalized to the same scale in size, subsequently CodeLlama inputs are also normalized. See below for a mesh visualization of a Laptop object and a Box object.Â
To increase the robustness of our part segmentation module, a promising future direction is to leverage the latest SAMv2 model (released after our submission), which does object tracking and segmentation from video inputs. This means we can concatenate multi-view inputs into a video, run our fine-tuned model to propose kinematically-accurate object parts first, then run SAMv2 to obtain view-consistent masks. We provide a video illustration of this process below: