Achieving precise, scalable part-level segmentation in 3D vision and robotics is challenging, especially for fine-grained object details and language-guided tasks. We introduce a zero-shot, multi-granularity semantic labeling pipeline optimized for text prompts and annotated datasets. Leveraging super points and 2D foundation models, our method generates detailed, non-exclusive semantic annotations for intricate object parts, bridging interactive segmentation and high-level 3D language tasks.
Our approach produces a large-scale dataset from Objaverse, combining fine-grained detail with extensive semantic coverage, advancing 3D perception and enabling new benchmarks for robotics and vision applications.
To achieve greater generality in dataset creation, we aim to generate data across multiple levels of granularity, requiring extensive annotations with varied levels of label detail. To address this challenge, we developed a novel pipeline capable of generating pseudo-labels at controllable granularity levels, as illustrated in figure below. This pipeline produces multi-granular semantic labels alongside accurate segmentations at the super-point level, which are subsequently processed and grouped by 2D foundation models.
A key focus of our method is ensuring the accuracy and reliability of annotations, even if it occasionally necessitates accepting some correct labels that might be discarded during processing. By prioritizing precise and consistent semantic labeling, our pipeline enables the creation of high-quality datasets designed to train 3D foundation models effectively, capturing critical details essential for downstream tasks.
Our primary objective is to generate accurately annotated segmentations. To evaluate the quality of these segmentations, we randomly selected 190 simple objects (fewer than 20,000 faces and limited in complexity) from the Objaverse dataset. To assess annotation accuracy automatically, we leverage CLIP to compute the similarity between text embeddings and image embeddings. Specifically, CLIP verifies whether a segmented part corresponds to the correct semantic label by providing probabilities for all candidate labels.
To refine the evaluation process, we render only the segmented parts from multiple views, addressing potential ambiguities arising from challenging perspectives. In cases where parts are difficult to distinguish, we assign the semantic label corresponding to the highest probability score, ensuring precise and reliable annotations.
Using our data engine, we have generated a dataset from more than 7,000 objects in Objaverse. This dataset can be leveraged to train 3D foundation models, such as 3D object recognition, shape completion, and 3D LLM.
Additionally, it can be re-rendered into 2D images for training 2D models, enhancing their capabilities in tasks like part-level segmentation, object detection, and fine-grained classification. While our annotations originate from 2D models, we utilize different foundation models(GPT4o-mini and GroundingDINO) allowing for robust cross-validation. Furthermore, our process employs various filters and restrictions, ensuring that the 2D annotations are cross-verified with 3D information for consistency and accuracy.