Learning the RoPEs:
Better 2D and 3D Position Encodings with STRING
Abstract
We introduce STRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings (RoPE), a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING maintains exact translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Transformer-based vision encoders with RGB(-D) inputs (color plus optional depth), demonstrating substantial downstream gains. These include improvements to open-vocabulary object detection models and diffusion-based robotic controllers. We complement our extensive experiments, both robot-based and more general, with mathematical analysis.
STRING
Formally, STRING rotation matrices are defined as matrix-exponentials of the linear combinations of skew-symmetric generator-matrices (see: bottom-left block). Importantly, those generators commute and are learnable. Using that definition, it is easy to see why RoPE is a special case of STRING. For RoPE, the generators are block-diagonal matrices with 2x2 skew-symmetric blocks. In STRING, matrices no longer have this rigid structure. In particular, even if the block-diagonal form is applied (which does not need to be the case anymore), the blocks do not need to be of size 2x2. In the extreme case, a block might be even as large as the entire matrix. Using combinations of size-varying blocks is also possible. We found two variants of STRING to work particularly well, e.g. outperforming RoPE. We refer to them as: (1) Cayley-STRING, and (2) Circulant-STRING. We obtain the first one by left- and right-multiplying RoPE matrices with learnable orthogonal matrix P and its transpose respectively. The name follows from the Cayley Transform used to parameterize P. Interestingly, since ultimately (while computing PEs), we care about the action of the STRING matrix on a given vector, we can avoid explicit matrix-inverse computations and instead apply Cayley-STRING by leveraging fast linear-systems solvers. Circulant-STRING matrices can be re-written in a compact way as exponentiated skew-symmetric matrices S, where S is defined as a difference of the circulant matrix C (see: bottom-rigt) and its transpose. Finally, the first row of C is given as a vector of dot-product of the coordinate-vector corresponding to the token with a sequence of learnable vectors. Efficient computations of PEs with Circulant-STRINGs is guaranteed via Fast Fourier Transform (FFT). Interestingly, both Cayley-STRING and Circulant-STRING can be written in the form used in the definition below, thus are indeed STRINGs.
Next, we show STRINGs in action for several robotics applications. In all of them, STRINGs are applied inside vision encoders in the corresponding controllers' backbones.
ROBOTICS
ALOHA Unleashed in Simulation
We evaluate the performance of STRING on dexterous robotics tasks using ALOHA 2, a bimanual parallel-jawgripper workcell with two 6-DoF arms, within the ALOHA Unleashed simulation environment. We trained ALOHA Unleashed’s Transformer-based neural network with diffusion policies (conditioned on vision encoders) on human demonstrations of 12 dexterous tasks. The vision system utilized images from RGB cameras positioned overhead, on each wrist, and at the level of the table.
The evaluations from the best-performing checkpoints for each task using STRING with a SigLIP vision transformer are presented below. Note that a green tick indicates a successful run and red cross mark indicates a failed run.
Additonally, the best performing simulation evaluations from the largest improvement and only regression with respect to the baseline SigLIP model are presented below.
ALOHA Real World Evals
Furthermore, we demonstrate STRING's capability on dextrous manipulation with real world ALOHAs. Qualitatively, we observed comparable performance in pick-and-place type tasks. STRING also demonstrated an advantage in tasks where precise grasping is necessary, e.g. folding a dress.
Fold dress
Put the bowl into the drying rack
Kuka-IIWA Real-World Evals
We evaluate the performance of STRING on robotics tasks using a bimanual setup of KUKA-IIWA. We trained policies using the energy-based parameterization of diffusion model. The trained policies use overhead cameras as well as wrist cameras. For both cameras we use RGB images but we use depth images only for the overhead camera. For more details, please refer to the paper.
Out-of-Distribution Experiments
We also show qualitative results for our out-of-distribution evaluations. In the below figures, the top figures show the failure scenarios for 2D policies, while the bottom row shows successful policy evaluations for the STRING-based 3D policy.
Table height experiments: Finally, we also show qualitative results for the table-height out of distribution evaluation, where STRING based policy performs 4x better than the 2D baseline.
General Experiments
3D Detection
We evaluate the performance of STRING on the 3D detection task. Given an input image and a list of text strings describing objects in the scene, the model must output a 3D bounding box for each object. We base our implementation off of OWL-ViT, which we modify to output the full SE3 transform and size of each bounding box. We generated a synthetic dataset of 4 million images, 80 of which we held out for testing and the remainder we used to train our model.
Below is a table showing average intersection-over-union of the predicted bounding boxes with the ground truch boxes on the test set:
The table shows results for RGB input only(ViT) and RGB with depth input (ViTD). It also shows the results for different RoPE and STRING variants. From this it is clear that there is a significant performance boost when including depth images over RGB alone. Furthermore, the STRING variants outperform both the baseline (no relative positional encoding) and the RoPE variants. The images below show some example bounding boxes:
Example 3D bounding box predictions. The predictions are in blue and the ground truth is in green.