Rongyu Zhang*, Menghang Dong*, Yuan Zhang, Heng Liang, Xiaowei Chi, Gaole Dai,
Li Du, Dan Wang, Yuan Du, Shanghang Zhang
Nanjing University; The Hong Kong Polytechnic University;
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University;
The Hong Kong University of Science and Technology
Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, miming the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost in MoLe, we devise cognition self-knowledge distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, achieving performance improvement of 8\% mean success rate across ten tasks while reducing at most $\times 5.6$ computational costs in LLM.
We compare the performance of our proposed MoLe method with state-of-the-art VLA models across ten RLBench tasks, utilizing only half of the LLM layers for efficiency.
Put rubish in bin
Toilet seat down
Close box
Phone on base
Change clock
Close laptop lid
Swept to dust pan
take frame off hanger
For our real-world experiments, we utilize the Franka Research 3 (FR3) robotic arm as the hardware platform. To overcome the limitations of the FR3's default gripper, which has relatively short fingers and struggles with certain complex tasks, we 3D-printed and replaced it with a UMI gripper. A GoPro 9 camera is positioned to the right of the setup to capture high-quality RGB images, providing visual input for the pipeline.
We conduct experiments on three tasks: detach charger, pull drawer, and pour water. Keyframes are extracted to construct the training set for each task, with \textbf{10} frames used for each. The figure illustrates the experimental setup and assets.
Pour water
Detach cahrger
Pull drawer