Bingchen Gong, Diego Gomez, Abdullah Hamdi, Abdelrahman Eldesokey, Ahmed Abdelreheem, Peter Wonka and Maks Ovsjanikov
[Paper] [Code]
TL;DR: We discovered that ChatGPT does not work on detecting points and we introduce a new zero-shot method for detecting keypoints on 3D shapes.
Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional 3D keypoint detection methods depend heavily on annotated 3D datasets and extensive supervised training. This dependence limits their scalability and makes it hard to apply them to new categories or domains.
Our approach is different. We leverage the rich knowledge found in Multi-Modal Large Language Models (MLLMs). For the first time, we demonstrate that the pixel-level annotations used to train recent MLLMs can be harnessed to both extract and name important keypoints on 3D models, without any ground truth labels or supervision.
Our experiments show that our method performs competitively on standard benchmarks when compared to supervised methods, even though it doesn't require any 3D keypoint annotations during training. These results highlight the potential of using language models for understanding localized 3D shapes. Our work opens up new opportunities for cross-modal learning and underscores how effective MLLMs can be in tackling challenges in 3D computer vision.
We propose investigating MLLMs endowed with point-level reasoning in the context of 3D shape understanding and specifically for zero-shot keypoint detection. Given a 3D model from an arbitrary category, our main task is to localize and name salient keypoints on this model. In this paper, we propose a comprehensive zero-shot 3D keypoint detection method that exploits the point-level reasoning capabilities of recent MLLMs while integrating 3D consistency and being completely category agnostic. To the best of our knowledge, ours is the first robust method for zero-shot 3D keypoint detection.
We consider the Zero-Shot Keypoint Detection problem for 3D shapes. Specifically, given a 3D shape, we aim to automatically generate a set of salient points corresponding to the shape's semantic parts. Our solution comprises three main components: first, we prompt a MLLM with the shape, asking the model to generate a list of names for possible candidate keypoints. Then, for each candidate, we ask the model to detect the precise coordinates of the point in a given image. Finally, we back-project those detected points into 3D and aggregate them to get the location of 3D keypoints in a given shape.
These renderings are from the ablation study, where the MLLM backbone is replaced with GPT-4o. As an MLLM trained solely on image-level tasks, it struggles with point-level reasoning, leading to less accurate and consistent keypoints compared to Molmo. This comparison highlights the necessity of specialized training or architectural features that support fine-grained reasoning in visual contexts.
[pdf] (10MB)
If you find our work is helpful, please consider citing:
@InProceedings{gong2024zerokey,
title={ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models},
author={Gong, Bingchen and Gomez, Diego and Hamdi, Abdullah and Eldesokey, Abdelrahman and Abdelreheem, Ahmed and Wonka, Peter and Ovsjanikov, Maks},
year={2024}
}