Anonymous Authors
Vision-language-action (VLA) models have become effective generalist manipulation policies by reusing semantic priors from large pretrained vision-language backbones. However, their visual interfaces are typically built from 2D image tokens, leaving known camera intrinsics and extrinsics to be inferred indirectly from action supervision. This is a poor inductive bias for multi-camera robot systems, where views are geometrically coupled by calibration and where many manipulation failures arise from spatial ambiguity rather than semantic misunderstanding.
We propose G3VLA (G cubed VLA), a lightweight camera-aware visual-token pathway for pretrained VLAs. The method combines intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion to inject calibrated geometry before action generation. It does not change the pretrained backbone, action representation, policy interface, or imitation objective. Geometry supervision is obtained from simulator point maps when available, or from confidence-gated π3X teacher predictions when only RGB and camera calibration are available.
Across LIBERO, RoboCasa24, RoboTwin2.0, and real-robot experiments, G3VLA improves most on object- and spatial-sensitive tasks. Results on π0, π0.5, and GR00T 1.5 further suggest that geometry transfer is strongest when geometry-aware tokens directly participate in the action generation pathway.
Geometry gap in VLA visual tokens
We identify a mismatch between 2D image-token representations and the calibrated spatial structure used by real robot camera rigs, especially under viewpoint and intrinsic shifts.
Backbone-preserving camera-aware module
G3VLA injects calibrated geometry through ray embeddings, PRoPE, and cross-view fusion while leaving the base action objective unchanged, while adding a training-only geometry distillation regularizer.
Dense geometry supervision without depth at deployment
The geometric pathway is trained with point-map targets from simulator ground truth or π3X teacher predictions. The deployed policy still consumes only RGB, language, state, and calibration.
Multi-benchmark and multi-backbone validation
We evaluate on LIBERO, RoboCasa24, RoboTwin2.0, and real-robot camera-shift settings, and test transfer across π0, π0.5, and GR00T 1.5 style VLA architectures.
Given RGB observations, language, robot state, and calibrated camera parameters, G3VLA transforms visual patch tokens into geometry-aware tokens before they are consumed by the unchanged VLA action model.
For each patch center u, the normalized ray K-1u provides a calibration-dependent viewing direction. A zero-initialized ray encoder adds this signal to the image token stream without disrupting the pretrained visual behavior at the start of fine-tuning.
PRoPE augments attention with camera-derived projective relations. Instead of treating cameras as independent image streams, attention receives a geometric bias computed from intrinsics and extrinsics.
Frame attention preserves within-view structure, while cross-view attention exchanges calibrated context across camera streams. The output remains a standard sequence of visual tokens for the action model.
Stage 1 trains the newly introduced geometry modules and auxiliary point head. Stage 2 unfreezes the full policy and optimizes the original action objective with a weaker geometry regularizer.
Supervision is provided either by ground-truth point maps in simulation or by confidence-gated π3X predictions for RGB-only datasets, avoiding depth sensors at deployment time.
Consistent gains on spatially demanding manipulation benchmarks.
LIBERO average gain for π0 with GT geometry supervision.
Largest suite-level gain on LIBERO-Object, where object localization is critical.
LIBERO average after transferring the camera-aware interface to π0.5.
On RoboTwin2.0, π³X distillation underperforms due to synthetic-domain teacher mismatch, while GT supervision improves success.
Camera-shift evaluation on a bimanual UR5 workbench.
The benchmark includes Pick-and-Place Test Tube and Pouring Nut. Both tasks require precise spatial grounding under changing context-camera poses.
The benchmark includes Pick-and-Place Test Tube and Pouring Nut. Both tasks require precise spatial grounding under changing context-camera poses.