ViTaPEs:

Visuotactile Position Embeddings for Cross-Modal Alignment in Multimodal Transformers

Read the Paper

Check the Code

Models

Paper under review for TMLR.

Preprint, code, and models will be made available upon paper publication.

ViTaPEs: Visuo-Tactile Positional Embeddings

Unifying Vision and Touch with Transformer-Based Spatial Representations

Task-accuracy radar comparing visuo-tactile models. ViTaPEs outperforms all others in robustness and cross-domain generalization.

TL;DR

ViTaPEs is a ViT-based framework that integrates visual and tactile data using a novel and theoretically grounded positional encoding formulation for robust, task-agnostic visuotactile representation learning and zero-shot generalization.

ViTaPEs

Humans use both sight and touch interchangeably to understand objects and interact with their environment.

In robotics, however, vision and tactile sensing are often handled separately, which can cause robots to miss important details or have limited perception of the environment.

ViTaPEs addresses this by learning a single spatial representation that combines visual and tactile data.

By integrating these two types of information from the beginning, robots can recognize materials more reliably, grasp objects more precisely, and adapt to new items without extensive retraining.

ViTaPEs framework: The visual and tactile inputs are projected into separate token spaces, followed by the addition of modality-specific (green and orange) and shared (purple) global PEs for multi-modal fusion.

Contribution

Multi-Scale Spatial Reasoning

Learns fine-grained (local) and coarse (global) positional information for both image patches and tactile patches.

Provable Injectivity & Equivariance

Theoretical guarantees ensure that different spatial configurations map to distinct embeddings, and that spatial transformations (e.g., rotations, translations) produce predictable changes in the representation.

Transformer-Based Fusion

A cross-modal transformer backbone that jointly attends over visual and tactile features, enabling context-aware integration.

Zero-Shot Transfer Across Sensors

Trained on one set of cameras or tactile arrays, ViTaPEs can generalize to unseen devices without additional retraining.

The Method

Uni-Modal Positional Encodings

For each modality (vision & touch), we first compute local positional embeddings at the patch level.

Vision: Divide an RGB image into fixed-size patches. Each patch’s 2D pixel coordinates are projected through a small MLP to produce a position vector. These position vectors are then added to the patch’s feature (e.g., CNN or ViT embedding).
Touch: Similarly, the tactile sensor’s contact data is discretized into patches (e.g., small taxel regions). Each tactile patch’s physical location on the sensor array is encoded via a separate MLP. The resulting “touch position” vectors are summed with the corresponding tactile feature vectors.
Key Idea: By injecting explicit 2D positional information early, each modality’s features become aware of their spatial origin, which is crucial for cross-modal alignment downstream.

Global Positional Encoding

While unimodal encodings capture local spatial context, we also need a notion of “global” layout to relate distant patches across both modalities.

Method: Compute a second set of embeddings that reflect the patch’s offset relative to the entire sensing field (e.g., top-left vs. bottom-right of the image or sensor). A separate MLP processes normalized (x, y) coordinates, producing global positional vectors.
Fusion: The unimodal (local) and global positional vectors are concatenated or summed with each patch’s feature before entering the transformer. This two-tier structure ensures that ViTaPEs can distinguish patches by both local neighborhood and overall location, which is critical for tasks like object shape inference or complex grasp planning.
Benefit: Global encodings guarantee that widely separated patches, such as a corner of the image and the opposite corner of the tactile array, are still placed in a coherent spatial frame.

Transformer Fusion & Theoretical Guarantees

Once each patch (vision or touch) carries both local and global positional information, we concatenate all patch tokens into a single sequence and feed them into a cross-modal transformer.

Cross-Attention Layers: At each transformer block, visual patches can attend to tactile patches and vice versa. This mutual attention enables the model to ground touch signals in the corresponding visual context (e.g., texture seen vs. texture felt).
Theoretical Properties: We prove that, under mild assumptions on the MLP architectures (e.g., universality), the combined positional embedding function is injective—meaning no two different spatial configurations map to the same ViTaPEs code. Moreover, small rigid transformations (translations, rotations) of the object will produce predictable, equivariant shifts in the embedding space.
Outcome: These guarantees ensure that the model cannot “collapse” distinct spatial arrangements into a single representation, leading to more reliable generalization—particularly when dealing with novel object poses or sensor displacements.

Performance

Supervised Model Performance

We compare the top-1 accuracy of four models (Vanilla CNN, VTT, RoPE, and ViTaPEs) across four supervised visuotactile tasks: Category, Hardness, Texture, and OF Real object detection. Each model is represented by a distinct color (Vanilla CNN in blue, VTT in orange, RoPE in green, and ViTaPEs in red). The x-axis runs from 40 % to 100 % to emphasize performance differences, and ViTaPEs consistently achieves the highest accuracy on every task (80.1 % for Category, 94.8 % for Hardness, 89.7 % for Texture, and 92.7 % for OF Real), demonstrating its superiority over the baselines.

Self-Supervised Model Performance

We show the accuracy of six models (TAG, SSVTP, MViTac, VTT, RoPE, and ViTaPEs) on four self-supervised tasks (Category, Hardness, Texture, and OF Real) as well as cross-sensor transfer on YCB. Each model has a unique, high-contrast color (TAG in purple, SSVTP in brown, MViTac in pink, VTT in orange, RoPE in green, and ViTaPEs in red). The x-axis spans 50 % to 100 %. ViTaPEs again leads all models on every metric (75.0 % for Category, 92.2 % for Hardness, 87.2 % for Texture, 85.2 % for OF Real, and 96.9 % on YCB), underscoring its effectiveness even without supervision.

Grasp Prediction Performance

We compare six models (SSVTP, TAG, MViTac, VTT, RoPE, and ViTaPEs) on the Grasp dataset using three evaluation protocols: SSL, Linear, and Zero. The x-axis ranges from 50 % to 75 %, and each model is shown in a unique color (SSVTP in brown, TAG in purple, MViTac in pink, VTT in blue, RoPE in green, and ViTaPEs in red). ViTaPEs achieves the highest grasp-success accuracy under SSL (70.7 %), remains competitive under linear probing (69.3 %), and leads zero-shot evaluation (60.4 %), demonstrating its robust generalization to unseen robotic grasping conditions.

Linear Probe & Zero Shot Accuracy

We display linear-probe and zero-shot accuracy for five models (MViTac, UniTouch, VTT, RoPE, and ViTaPEs) on two dataset splits: TAG and OF Real. Groups of bars correspond to “Linear TAG,” “Linear OF-Real,” “Zero TAG,” and “Zero OF-Real,” spaced apart along the x-axis from 30 % to 75 %. Each model is colored distinctly (MViTac in pink, UniTouch in gray, VTT in blue, RoPE in green, and ViTaPEs in red). Missing values for UniTouch on “Linear TAG” and “Zero TAG” are labeled “N/A.” ViTaPEs outperforms all other models in both linear (53.1 % on TAG, 68.1 % on OF-Real) and zero-shot (53.8 % on TAG, 65.2 % on OF-Real) settings, highlighting its strong cross-domain transferability.

Ablation Study

This table illustrates the impact of different positional encoding (PE) configurations on model accuracy for the TAG Category classification task. Each row corresponds to a specific ablation variant, showing whether Visual PE, Tactile PE, and Global PE were used as Learned, Sinusoidal, or None. The final column reports the resulting classification accuracy (%). The ViTaPEs row, highlighted for emphasis, represents the full model using learned encodings for all components and achieves the highest performance at 80.1%. Removing or altering individual PE components leads to notable accuracy drops, particularly when entire modalities are excluded (e.g., "Only Tactile", "Only Vision"), underscoring the importance of all three positional encoding streams for optimal visuotactile representation learning.

Page updated

Google Sites

Report abuse