Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation
Authors: Jiaming Liu 1, Chenxuan Li 1🍩, Guanqun Wang 1🍩, Lily Lee 1, Kaichen Zhou 1,
Sixiang Chen 1, Chuyan Xiong, Jiaxin Ge 1, Renrui Zhang, Shanghang Zhang 1🍭
🍩:Equal technical contribution; 🍭:Corresponding author
Affiliation: 1) National Key Laboratory for Multimedia Information Processing,
School of Computer Science, Peking University;
Main contributions:
To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions
Our SC-MLLM makes the first attempt to detect the failure cause of low-level pose prediction. Based on the cause, SC-MLLM can adaptively request prompt feedback from experts to rethink the current failure scene and generate the corrected action
We design a continuous policy learning method for corrected samples, enhancing the model's adaptability to scene configurations and reducing expert intervention frequency.
Close loop correction:
Step 1: SC-MLLM reframes pose prediction as a language modeling problem, utilizing the initial state image and text prompts to generate the action pose.
Step 2: SC-MLLM exploits the end-state image and end-effector parameters for failure recognition and intelligently requests prompt feedback from experts to generate corrected poses.
Step 3: SC-MLLM continuously learns policies from successfully corrected samples, enhancing the model's adaptability to the current scene configuration. Through correction steps, we efficiently provide a customized policy for each user, rather than relying on a shared, low-accuracy policy
Failure case example
Successful correction example