Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

Authors: Jiaming Liu 1, Chenxuan Li 1🍩, Guanqun Wang 1🍩, Lily Lee 1, Kaichen Zhou 1,

Sixiang Chen 1, Chuyan Xiong, Jiaxin Ge 1, Renrui Zhang, Shanghang Zhang 1🍭

🍩:Equal technical contribution; 🍭:Corresponding author

Affiliation: 1) National Key Laboratory for Multimedia Information Processing,

School of Computer Science, Peking University;

Arxiv paper

Code link

Main contributions:

To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions
Our SC-MLLM makes the first attempt to detect the failure cause of low-level pose prediction. Based on the cause, SC-MLLM can adaptively request prompt feedback from experts to rethink the current failure scene and generate the corrected action
We design a continuous policy learning method for corrected samples, enhancing the model's adaptability to scene configurations and reducing expert intervention frequency.

Close loop correction:

Step 1: SC-MLLM reframes pose prediction as a language modeling problem, utilizing the initial state image and text prompts to generate the action pose.
Step 2: SC-MLLM exploits the end-state image and end-effector parameters for failure recognition and intelligently requests prompt feedback from experts to generate corrected poses.
Step 3: SC-MLLM continuously learns policies from successfully corrected samples, enhancing the model's adaptability to the current scene configuration. Through correction steps, we efficiently provide a customized policy for each user, rather than relying on a shared, low-accuracy policy

Failure case example

Successful correction example

Page updated

Google Sites

Report abuse