Published in Transactions on Machine Learning Research (TMLR), April 2026
Code, and models will be made available soon!
Task-accuracy radar comparing visuo-tactile models. ViTaPEs outperforms all others in robustness and cross-domain generalization.
ViTaPEs is a ViT-based architecture that integrates visual and tactile data using a novel and theoretically grounded positional encoding formulation for robust, task-agnostic visuotactile representation learning and zero-shot generalization.
Humans use both sight and touch interchangeably to understand objects and interact with their environment.
In robotics, however, vision and tactile sensing are often handled separately, which can cause robots to miss important details or have limited perception of the environment.
ViTaPEs addresses this by learning a single spatial representation that combines visual and tactile data.
By integrating these two types of information from the beginning, robots can recognize materials more reliably, grasp objects more precisely, and adapt to new items without extensive retraining.
ViTaPEs architecture: The visual and tactile inputs are projected into separate token spaces, followed by the addition of modality-specific (green and orange) and shared (purple) global PEs for multi-modal fusion.
Learns fine-grained (local) and coarse (global) positional information for both image patches and tactile patches.
A cross-modal transformer backbone that jointly attends over visual and tactile features, enabling context-aware integration.
Trained on one set of cameras or tactile arrays, ViTaPEs can generalize to unseen devices without additional retraining.
For each modality (vision & touch), we first compute local positional embeddings at the patch level.
Vision: We divide the RGB image into fixed-size patches and map them to visual tokens. A learned visual positional encoding is then added to each token, so the representation carries its within-stream spatial layout before cross-modal interaction.
Touch: We similarly discretize the tactile input into patches and map them to tactile tokens. A separate learned tactile positional encoding is added to each tactile token, preserving the spatial layout of the tactile stream.
Key Idea: By injecting modality-specific positional encodings within each stream, the model preserves where features come from locally. A global positional encoding is then added to the joint token sequence immediately before attention, providing shared positional signals when cross-modal interaction occurs.
While modality-specific encodings preserve within-stream spatial structure, ViTaPEs also adds a learned global positional encoding on the joint token sequence immediately before self-attention. This provides shared positional signals at the stage where cross-modal interaction occurs.
Method: After adding local positional encodings within each modality, the visual and tactile token sequences are concatenated, passed through a shared token-wise non-linear projection head, and then augmented with a single learned global positional encoding before entering the transformer.
Fusion: The global positional encoding is not computed from a second coordinate-based MLP, and it is not concatenated with each patch feature as a separate “global vector.” Instead, it is added to the joint sequence immediately before attention, complementing the local positional encodings already injected within each modality.
Benefit: This two-stage design lets the model preserve within-modality geometry while also supplying shared positional signals when visual and tactile tokens interact, supporting cross-modal correspondence learning without assuming a geometrically calibrated common frame.
Once local positional encodings have been added within each modality, the visual and tactile patch tokens are concatenated into a single sequence and passed through a shared token-wise non-linear projection head before entering the transformer.
Self-attention: Within the transformer, self-attention operates on the joint visual–tactile token sequence, allowing the model to capture both within-modality structure and cross-modal dependencies between visual and tactile tokens.
Interaction mechanism: A global positional encoding is added on the joint token sequence immediately before attention, providing shared positional signals at the stage where cross-modal interaction occurs. This helps the model learn correspondences between what is seen and what is felt without assuming a geometrically calibrated common frame.
Outcome: Rather than relying on theoretical guarantees about preventing collapse, the final claim is empirical: this two-stage positional design improves representation learning, transfer, and grasp-success prediction across multiple real-world visuotactile benchmarks.
We compare the top-1 accuracy of four models (Vanilla CNN, VTT, RoPE, and ViTaPEs) across four supervised visuotactile tasks: Category, Hardness, Texture, and OF Real object detection. Each model is represented by a distinct color (Vanilla CNN in blue, VTT in orange, RoPE in green, and ViTaPEs in red). The x-axis runs from 40 % to 100 % to emphasize performance differences, and ViTaPEs consistently achieves the highest accuracy on every task (80.1 % for Category, 94.8 % for Hardness, 89.7 % for Texture, and 92.7 % for OF Real), demonstrating its superiority over the baselines.
We show the accuracy of six models (TAG, SSVTP, MViTac, VTT, RoPE, and ViTaPEs) on four self-supervised tasks (Category, Hardness, Texture, and OF Real) as well as cross-sensor transfer on YCB. Each model has a unique, high-contrast color (TAG in purple, SSVTP in brown, MViTac in pink, VTT in orange, RoPE in green, and ViTaPEs in red). The x-axis spans 50 % to 100 %. ViTaPEs again leads all models on every metric (75.9 % for Category, 92.2 % for Hardness, 87.2 % for Texture, 85.2 % for OF Real, and 96.9 % on YCB), underscoring its effectiveness even without supervision.
We compare six models (SSVTP, TAG, MViTac, VTT, RoPE, and ViTaPEs) on the Grasp dataset using three evaluation protocols: SSL, Linear, and Zero. The x-axis ranges from 50 % to 75 %, and each model is shown in a unique color (SSVTP in brown, TAG in purple, MViTac in pink, VTT in blue, RoPE in green, and ViTaPEs in red). ViTaPEs achieves the highest grasp-success accuracy under SSL (70.7 %), remains competitive under linear probing (69.3 %), and leads zero-shot evaluation (60.4 %), demonstrating its robust generalization to unseen robotic grasping conditions.
We display linear-probe and zero-shot accuracy for five models (MViTac, UniTouch, VTT, RoPE, and ViTaPEs) on two dataset splits: TAG and OF Real. Groups of bars correspond to “Linear TAG,” “Linear OF-Real,” “Zero TAG,” and “Zero OF-Real,” spaced apart along the x-axis from 30 % to 75 %. Each model is colored distinctly (MViTac in pink, UniTouch in gray, VTT in blue, RoPE in green, and ViTaPEs in red). Missing values for UniTouch on “Linear TAG” and “Zero TAG” are labeled “N/A.” ViTaPEs outperforms all other models in both linear (53.1 % on TAG, 68.1 % on OF-Real) and zero-shot (53.8 % on TAG, 65.2 % on OF-Real) settings, highlighting its strong cross-domain transferability.
This table illustrates the impact of different positional encoding (PE) configurations on model accuracy for the TAG Category classification task. Each row corresponds to a specific ablation variant, showing whether Visual PE, Tactile PE, and Global PE were used as Learned, Sinusoidal, or None. The final column reports the resulting classification accuracy (%). The ViTaPEs row, highlighted for emphasis, represents the full model using learned encodings for all components and achieves the highest performance at 80.1%. Removing or altering individual PE components leads to notable accuracy drops, particularly when entire modalities are excluded (e.g., "Only Tactile", "Only Vision"), underscoring the importance of all three positional encoding streams for optimal visuotactile representation learning.
@article{lygerakis2026vitapes,
title={ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers},
author={Fotios Lygerakis and Ozan {\"O}zdenizci and Elmar Rueckert},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=mxzzO66Zbu}
}