RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Authors: Jiaming Liu 1 2, Mengzhen Liu 1🍩, Zhenyu Wang 1, Pengju An 1, Xiaoqi Li 1, Kaichen Zhou1, Senqiao Yang 1, Renrui Zhang, Yandong Guo 2, Shanghang Zhang 1 3 🍭

🍩:Equal technical contribution; 🍭:Corresponding author

Affiliation:1 - National Key Laboratory for Multimedia Information Processing,

School of Computer Science, Peking University,

2 - AI2Robotics，3-Beijing Academy of Artificial Intelligence (BAAI)

Arxiv paper

Main contributions:

We innovatively integrate a vision encoder with the efficient Mamba language model to construct our end-to-end RoboMamba, which possesses visual common sense and robot-related reasoning abilities.

To equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy using a simple policy head. We find that once RoboMamba achieves sufficient reasoning capabilities, it can acquire pose prediction skills with minimal cost.

In our extensive experiments, RoboMamba excels in reasoning on general and robotic evaluation benchmarks, and showcases impressive pose prediction results in both simulation and real-world experiments

Stage 1.1 Alignment pre-training ：only update the project layer

Stage 1.2 Instruction co-training ：combining high-level robotic data (e.g., task planning) with general instruction data.

Stage 2 Robot manipulation fine-tuning： freeze all the parameters of RoboMamba and introduce a simple policy head to model Mamba's output tokens. The policy head contains two MLPs that separately learn the end-effector's position and direction

Successful example in real world

Page updated

Google Sites

Report abuse