UniJEPA
Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning
Jianke Zhang*, Yucheng Hu*, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen
Jianke Zhang*, Yucheng Hu*, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen
Joint state prediction has garnered significant attention in vision-language-action models. However, prior approaches have predominantly focused on visual signal generation, while overlooking multimodal understanding and the dynamic characterization of high-dimensional features. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of observation understanding and continuous future representation learning. Building on this insight, we introduce UniJEPA, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniJEPA is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from future state representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods across both simulation environments and real-world tasks.
We present a two-stage learning framework that unifies continuous and discrete representation learning for robot policy learning. In the first stage, we train a unified vision-language joint embedding model across diverse manipulation datasets to harness physical knowledge from internet-scale data. In the second stage, we design networks to aggregate predictive visual representations and output robot actions. UniJEPA employs a Mix-of-Transformers architecture with modality-specialized experts that enables effective cross-modal learning.
We conduct comprehensive experiments on both simulated and real-world robotic tasks to evaluate our approach. The simulated environments include the CALVIN benchmark and SimplerEnv benchmark, while the real-world tasks encompass Franka arm manipulation and XArm dexterous hand manipulation. Our method achieves state-of-the-art performance across all evaluation environments.
๐ Unified Framework
A unified vision-language-action (VLA) pretraining framework that seamlessly integrates discrete language understanding and continuous visual prediction.
๐ฏ Two-Stage Learning
A novel two-stage training strategy that preserves vision-language capabilities while enabling effective action learning through future state prediction.
๐ Superior Performance
State-of-the-art results across multiple benchmarks including SimplerEnv, Calvin, and real-world robotic platforms.
We conduct comprehensive experiments on both simulated and real-world robotic tasks to evaluate our approach. The simulated environments include the CALVIN benchmark and SimplerEnv benchmark, while the real-world tasks encompass Franka arm manipulation and XArm dexterous hand manipulation. Our method achieves state-of-the-art performance across all evaluation environments.
Our method achieves 71.0% success rate on SimplerEnv-WindowX and 78.4% success rate on SimplerEnv-Google Robot, establishing new state-of-the-art performance on both platforms. On the CALVIN ABC-D benchmark, our approach demonstrates superior long-horizon task completion capabilities, significantly outperforming existing baseline methods.
Simulation Results
On the SimplerEnv benchmark, our approach consistently surpasses existing state-of-the-art methods across multiple manipulation tasks.ย
On both Franka arm and XArm dexterous hand platforms, our method achieves the highest success rates across diverse manipulation tasks.ย
On both Franka-Emika and Xarm dexterous hand platforms, we deployed the robot in various tasks, including placing, cup-upright, relocating, stacking, passing, pressing, unplugging, and opening. We roll out policy with unseen objects and mark them in blue.