We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings (RoPE), a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING maintains exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint.
These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Transformer-based vision encoders with RGB(-D) inputs (color plus optional depth), demonstrating substantial downstream gains. These include improvements to open-vocabulary object detection models and diffusion-based robotic controllers. We complement our extensive experiments, both robot-based and more general, with mathematical analysis.
Formally, STRING rotation matrices are defined as matrix-exponentials of the linear combinations of skew-symmetric generator-matrices (see: bottom-left block). Importantly, those generators commute and are learnable. Using that definition, it is easy to see why RoPE is a special case of STRING. For RoPE, the generators are block-diagonal matrices with 2x2 skew-symmetric blocks. In STRING, matrices no longer have this rigid structure. We introduce two variants of STRING which outperforms RoPE.
Cayley-STRING
We left- and right-multiply RoPE matrices with learnable orthogonal matrix P and its transpose respectively. This allows the optimizer to learn an arbitrary rotation instead of the limited rotation learned in RoPE. Note that
the first P term can be discarded since it is canceled out in the query-key dot-product.
PT can be multiplied into Wq/k during inference resulting in the same performance as RoPE.
since P does not store any positional information, it can be shared across examples and does not need to be generated by the computationally expensive matrix exponential.
We use the Cayley transform which can be implemented through linear solvers.
Circulant-STRING
Circulant-STRING matrices can be re-written in a compact way as exponentiated skew-symmetric matrices S, where S is defined as a difference of the circulant matrix C (see: bottom-rigt) and its transpose. Finally, the first row of C is given as a vector of dot-product of the coordinate-vector corresponding to the token with a sequence of learnable vectors. Efficient computations of PEs with Circulant-STRINGs is guaranteed via Fast Fourier Transform (FFT). Interestingly, both Cayley-STRING and Circulant-STRING can be written in the form used in the definition below, thus are indeed STRINGs.
Next, we show STRINGs in action for several robotics applications. In all of them, STRINGs are applied inside vision encoders in the corresponding controllers' backbones.
We evaluate the performance of STRING on dexterous robotics tasks using ALOHA 2, a bimanual parallel-jawgripper workcell with two 6-DoF arms, within the ALOHA Unleashed simulation environment. We trained ALOHA Unleashed’s Transformer-based neural network with diffusion policies (conditioned on vision encoders) on human demonstrations of 12 dexterous tasks. The vision system utilized images from RGB cameras positioned overhead, on each wrist, and at the level of the table.
The evaluations from the best-performing checkpoints for each task using STRING with a SigLIP vision transformer are presented below. Note that a green tick indicates a successful run and red cross mark indicates a failed run.
Additonally, the best performing simulation evaluations from the largest improvement and only regression with respect to the baseline SigLIP model are presented below.
Mean success rate (best, second-best) over 10 evaluations of each simulation task. RoPE and Cayley-STRING are added to a baseline SigLIP B/16 256 ViT.
Mean success rate across all tasks (MultiTask in table) evaluated 10 times every 10K train steps over 1M train steps.
Furthermore, we demonstrate STRING's capability on dextrous manipulation with real world ALOHAs. Qualitatively, we observed comparable performance in pick-and-place type tasks. STRING also demonstrated an advantage in tasks where precise grasping is necessary, e.g. folding a dress.
Fold dress
Put the bowl into the drying rack
We evaluate the performance of STRING on robotics tasks using a bimanual setup of KUKA-IIWA. We trained policies using the energy-based parameterization of diffusion model. The trained policies use overhead cameras as well as wrist cameras. For both cameras we use RGB images but we use depth images only for the overhead camera. For more details, please refer to the paper.
We also show qualitative results for our out-of-distribution evaluations. In the below figures, the top figures show the failure scenarios for 2D policies, while the bottom row shows successful policy evaluations for the STRING-based 3D policy.
3D STRING policies outperform 2D policies across all OOD settings. We hypothesize that the 3D STRING policy leverages the raw depth signal to better generalize, a change imperceptible to fixed monocular cameras.
Comparison of 2D baseline with 3D STRING on out-of-distribution scenarios for real-world Kuka robot evaluations.
Implicit depth via normal maps
Depth input is used to construct a surface normal map with unit vectors per pixel. Both RGB and depth inputs are then processed via identical (shared weights) embedding layers. The embeddings are concatenated and processed through Transformer layers. Finally, the embeddings are split and fused to yield the final vision embedding. Our results below show that this approach of incorporating depth has a detrimental effect on the on-robot deployed policy. We hypothesize that this is caused by the significant amount of noise coming from the depth sensors, leading to imperfect surface normal maps.
Lifting patches to 3D for STRING
We compute the height for each patch via mean-pooling across depth values for all the pixels in the patch, followed by the learnable linear layer. The resulting 3D patches are then fed to the Transformer, with positional encoding given by STRING to incorporate depth into the vision encoder. Our results show that STRING improves the success rate over the 2D base policy. Also, when STRING is combined with the first method, it drastically reduces the negative impact of noisy normal maps.
Performance of STRING with 3D input vs. baselines on real-robot tasks (with 2 seeds). 2D baseline performance without depth input is ≈ 65%. Incorporating depth through surface normal maps (nmap) reduces performance to 42%. Using 3D STRING for incorporating depth improves the performance in both scenarios - with and without normal maps to 53% and 74% respectively. Mean/stdev shown above were calculated from 35 evaluation runs.
Image classification % test accuracy (best, second-best).
Next, we lifted WebLI, a dataset of 10 billion image-text pairs across a variety of languages, into 3D
by pre-processing a 60-million image subset with Depth-Anything-V2 for metric mono-depth
estimation and evaluated open-vocabulary detection and classification.
Image-to-text (i2t) and text-to-image (t2i) WebLI recall @ rank (best, second-best).
We evaluate the performance of STRING on the 3D detection task. Given an input image and a list of text strings describing objects in the scene, the model must output a 3D bounding box for each object. We base our implementation on OWL-ViT, which we modify to output the full SE(3) transform and size of each bounding box. We generated a synthetic dataset of 4 million images, 80 of which we held out for testing and the remainder we used to train our model.
Average 3D IOU % over all objects for the 3D bounding box prediction task. For each setting, 3 models were trained with different random seeds and the max is reported. Baseline indicates no RoPE or STRING.
The table shows results for RGB input only(ViT) and RGB with depth input (ViTD). It also shows the results for different RoPE and STRING variants. From this it is clear that there is a significant performance boost when including depth images over RGB alone. Furthermore, the STRING variants outperform both the baseline (no relative positional encoding) and the RoPE variants. The images below show some example bounding boxes:
Example 3D bounding box predictions. The predictions are in blue and the ground truth is in green.