MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

Paper

Code

Rongyu Zhang*, Menghang Dong*, Yuan Zhang, Heng Liang, Xiaowei Chi, Gaole Dai,

Li Du, Dan Wang, Yuan Du, Shanghang Zhang

Nanjing University; The Hong Kong Polytechnic University;

State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University;

The Hong Kong University of Science and Technology

Abstract

Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, miming the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost in MoLe, we devise cognition self-knowledge distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, achieving performance improvement of 8\% mean success rate across ten tasks while reducing at most $\times 5.6$ computational costs in LLM.

The overall framework of MoLe-VLA

Performance comparison

We compare the performance of our proposed MoLe method with state-of-the-art VLA models across ten RLBench tasks, utilizing only half of the LLM layers for efficiency.

Efficiency analysis

Simulation-robot experiment

put_rubbish_in_bin.mp4

Put rubish in bin

toilet_seat_down.mp4

Toilet seat down

close_box.mp4

Close box

phone_on_base.mp4

Phone on base

change_clock.mp4

Change clock

close_laptop_lid.mp4

Close laptop lid

sweep_to_dustpan.mp4

Swept to dust pan

take_frame_off_hanger.mp4

take frame off hanger

Real-world robot experiment

Real-world Franka robot setup

For our real-world experiments, we utilize the Franka Research 3 (FR3) robotic arm as the hardware platform. To overcome the limitations of the FR3's default gripper, which has relatively short fingers and struggles with certain complex tasks, we 3D-printed and replaced it with a UMI gripper. A GoPro 9 camera is positioned to the right of the setup to capture high-quality RGB images, providing visual input for the pipeline.

We conduct experiments on three tasks: detach charger, pull drawer, and pour water. Keyframes are extracted to construct the training set for each task, with \textbf{10} frames used for each. The figure illustrates the experimental setup and assets.

pour_water.mp4

Pour water

detach_charger.mp4

Detach cahrger

pull_drawer.mp4

Pull drawer

Failing Cases

detach_charger.mp4

pour_water.mp4

pull_drawer.mp4

Page updated

Google Sites

Report abuse