MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

*Equal Contribution Project Leader Corresponding Author
1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University2Beijing Innovation Center of Humanoid Robotics3CUHK 

Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations.

Method

Demonstrations

All tasks are trained and tested with keyframes

Franka Emika Panda Tabletop Manipulation

Wipe the whiteboard

Wipe the whiteboard (hard)

Put dish on rack

Press stamp on paper

Place egg on bread

Place egg on bread (background)

Place egg on bread (background)

Place egg on bread (object)

Scoop popcorn into a bowl

Open pot pick corn

Abstract

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations.

Real-world Setup

Method

For single-arm tasks, we employ a Franka Research 3 robotic arm equipped with a ROBOTIQ adaptive gripper as the end-effector. Visual observations are provided by two Intel RealSense D455 cameras, one positioned at a right-front third-person viewpoint and the other mounted on the wrist. In addition, two Tashan TS-E-A tactile sensors are attached to the fingertips of the gripper to capture tactile feedback. For dual-arm tasks, we utilize two parallel Franka Emika arms with the same end-effector configuration. The observation setup includes an additional front-facing RealSense D455 camera along with two wrist-mounted cameras, ensuring comprehensive multi-view perception.

Real-world Experiments

Method

To systematically evaluate our model, we design six complex, contact-rich real-world robotic experiments covering both single- and dual-arm manipulation tasks, where MLA achieves state-of-the-art success rates and demonstrates strong generalization to unseen objects and backgrounds.

RLBench Experiments

Method

For reproducibility, we further evaluate MLA on the RLBench simulator and also obtain competitive performance.

BibTeX

@misc{liu2025mlamultisensorylanguageactionmodel,
      title={MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation}, 
      author={Zhuoyang Liu and Jiaming Liu and Jiadong Xu and Nuowei Han and Chenyang Gu and Hao Chen and Kaichen Zhou and Renrui Zhang and Kai Chin Hsieh and Kun Wu and Zhengping Che and Jian Tang and Shanghang Zhang},
      year={2025},
      eprint={2509.26642},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.26642}, 
}