Zhuoyang Liu¹*, Jiaming Liu¹*, Jiadong Xu¹, Nuowei Han¹, Chenyang Gu¹, Hao Chen³, Kaichen Zhou¹,
Renrui Zhang³, Kai Chin Hsieh¹, Kun Wu², Zhengping Che², Jian Tang², Shanghang Zhang¹
¹State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, ²Beijing Innovation Center of Humanoid Robotics, ³CUHK
💡Abstract
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA’s understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: sites.google.com/view/openmla.
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA’s understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla
⚙️Model Architecture
We propose MLA, a multisensory language–action model that collaboratively processes diverse sensory inputs and predicts their corresponding future states to enhance physical-world modeling for robotic control. To avoid introducing additional modality-specific encoders that lack pretraining alignment with LLM’s embeddings, MLA adopts an encoder-free multimodal alignment mechanism, repurposing the initial transformer blocks of the LLM as a perception module to directly interpret visual, geometric, and tactile cues. In particular, we project 3D points and the spatial positions of the tactile gripper onto 2D image planes using camera parameters, thereby constructing cross-modal positional mappings. These positional correspondences serve as positive pairs for token-level contrastive learning, aligning multimodal features within the LLM’s embedding space. This position-guided consistency constraint enhances the multimodal representations of our MLA model and supports more comprehensive physical-world perception. To further enhance the LLM’s understanding of physical robotic scenes, we propose a future multisensory generation post-training strategy. Specifically, the lightweight transformer-based decoders and tailored generation scheme are designed to process the LLM’s final-layer features and generate the future states of multiple modalities, including 2D images, 3D point clouds, and tactile signals. Through this predictive process, MLA is able to reason about physical dynamics from multiple dimensions, encompassing semantic information, geometric structures, and object-centric interactions.
📊Real-world Experiments
MLA achieves superior performance across six tasks, outperforming pi0 and SpatialVLA by an average of 12% and 24%, respectively.
Ablation study. We systematically analyze the contributions of each component in the MLA model.
Generalization experiments. Visualization of the designed scenarios and quantitative results. The four scenarios involve unseen objects and unseen complex backgrounds, with red boxes highlighting the differences from the original setting.
a) Impact of Input Modalities and Alignment Strategies in the Encoder-Free Multimodal Alignment Scheme. As shown in the Figure left a), we first examine the role of different input modalities and alignment strategies under the following configurations: (Ex1) 2D image input only, (Ex2) 2D image + 3D point cloud with simple token-level concatenation, (Ex3) 2D image + 3D point cloud + tactile signals with simple token-level concatenation, (Ex4) all modalities with image-level contrastive alignment, and (Ex5) our proposed all modalities with token-level contrastive alignment.
b) Impact of Contrastive Loss Position. As shown in Figure left b), we investigate the effect of applying contrastive loss at different layers of the LLaMA-2 backbone. Specifically, we select the 4th, 8th, 12th, and 32nd layers for cross-modal alignment during the SFT and post-training stages.
c) Impact of Different Generation Modalities in Future State Generation. As shown in Figure left c), building upon the MLA model following SFT, we further evaluated three ablation variants during post-training: (1) without image generation, (2) without point cloud generation, and (3) without tactile signal generation.
🌟Demonstrations
wiping a whiteboard with an eraser (easy)
wiping a whiteboard with an eraser (hard)
placing a dish on a rack
pressing a stamp onto paper (easy)
pressing a stamp onto paper (hard)
placing an egg on bread with a spatula
opening a pot lid and picking corn from the pot
scooping popcorn into a bowl
placing an egg on bread with a spatula (change the color of the target plate)
placing an egg on bread with a spatula (replace the egg with lemon)
placing an egg on bread with a spatula (complex background 1)
placing an egg on bread with a spatula (complex background 2)
We present the inference and execution processes of the MLA model across multiple tasks. The MLA demonstrates the ability to accurately infer real-world actions and maintain stable interactions with diverse objects. It is worth noting that, since the Franka arm operates under force control, we adopt a slower execution speed for some tasks that are particularly sensitive to tactile feedback in order to ensure stability.
🤖Setup Details
For single-arm tasks, we employ a Franka Research 3 robotic arm equipped with a ROBOTIQ adaptive gripper as the end-effector. Visual observations are provided by two Intel RealSense D455 cameras, one positioned at a right-front third-person viewpoint and the other mounted on the wrist. In addition, two Tashan TS-E-A tactile sensors are attached to the fingertips of the gripper to capture tactile feedback.
For dual-arm tasks, we utilize two parallel Franka Emika arms with the same end-effector configuration. The observation setup includes an additional front-facing RealSense D455 camera along with two wrist-mounted cameras, ensuring comprehensive multi-view perception.
📊Real-world Experiments
MLA achieves superior performance across six tasks, outperforming pi0 and SpatialVLA by an average of 12% and 24%, respectively.
📃Pre-training Data
To ensure the quality and consistency of training data, we curated 28 high-quality datasets from Open-X-Embodiment, Droid and Robomind datasets and applied customized sampling ratios, resulting in a total of 570K trajectories and 36M frames (see the table above for details). The action representations across datasets were unified to align with those used in the fine-tuning stage, thereby maximizing the utility of pretraining. During pretraining, since these datasets only provide 2D image observations, we restricted the input modalities to 2D RGB images, language instructions, and robot states. Meanwhile, the token sequences corresponding to 3D point clouds and tactile signals were reserved as empty tokens, ensuring consistency of input sequences between pretraining and fine-tuning. In the fine-tuning stage, we further incorporated multi-view images, which were encoded through a shared tokenizer and concatenated sequentially after the single-view image tokens.
🏆Simulation Experiments
Visualization of our future image generation result during post-training.
Visualization of our future point cloud generation result during post-training.