ManipLLM:Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation 

        Xiaoqi LiMingxu ZhangYiran Geng, Haoran GengYuxing, LongYan ShenRenrui Zhang ,  Jiaming Liu , Hao Dong

Peking University

 [code]

Abstract

   We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MlLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning and object centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effectors pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner.​ Moreover, in the real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experimenis in simulator and real-world show the promising performance of ManipLLM.

Video Presentation

studio_video_1701055537054.mp4

Real-world Manipulation

studio_video_1703212289003.mp4

Overview

Figure 1. The prediction of ManipLLM. Given the text prompt, RGB image, and depth map inputs, we obtain 3D contact point (x, y, z). Here, x and y represent the pixel coordinates in the image predicted by ManipLLM, while z corresponds to the depth obtained from the depth camera. Additionally, ManipLLM predicts the gripper’s up direction (xu, yu, zu) and forward direction (xf , yf , zf ), forming the end-effector SO(3) rotation 

System Pipeline

Figure 2. Training details of ManipLLM. This paradigm contains four training tasks, enabling the model to recognize the current object (category-level), understand which regions can be manipulated (region-level), and finally generate a precise end-effector pose (pose-level). 

Figure 3. The chain-of-thought inference process of ManipLLM. 

Experiments

Cabinet

Fridge

Jar

Pot

Trashcan

Microwave

Armchair

Toilet