ManipLLM:Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
Xiaoqi Li , Mingxu Zhang , Yiran Geng, Haoran Geng, Yuxing, Long , Yan Shen , Renrui Zhang , Jiaming Liu , Hao Dong
Peking University
[code]
Abstract
We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MlLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning and object centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effectors pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover, in the real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experimenis in simulator and real-world show the promising performance of ManipLLM.
Video Presentation
Real-world Manipulation
Overview
Figure 1. The prediction of ManipLLM. Given the text prompt, RGB image, and depth map inputs, we obtain 3D contact point (x, y, z). Here, x and y represent the pixel coordinates in the image predicted by ManipLLM, while z corresponds to the depth obtained from the depth camera. Additionally, ManipLLM predicts the gripper’s up direction (xu, yu, zu) and forward direction (xf , yf , zf ), forming the end-effector SO(3) rotation
System Pipeline
Figure 2. Training details of ManipLLM. This paradigm contains four training tasks, enabling the model to recognize the current object (category-level), understand which regions can be manipulated (region-level), and finally generate a precise end-effector pose (pose-level).
Figure 3. The chain-of-thought inference process of ManipLLM.