Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning

Chahyon Ku1Carl Winge1,  Ryan Diaz1Wentao Yuan2Karthik Desingh1

University of Minnesota1, University of Washington2

To be presented at ICRA 2024

Presented at CoRL 2023 Pretraining for Robotics Workshop

Abstract

This paper primarily focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks. Specifically, it investigates the alignment and insertion of objects with geometrical extrusions and intrusions, commonly referred to as a peg-in-hole task. The accuracy required to detect and orient the peg and the hole geometry in SE(3) space for successful assembly poses significant challenges. Addressing this, we employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders. Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations. Our quantitative analysis shows that existing pretrained models fail to capture the essential visual features necessary for this task. However, a visual encoder trained from scratch consistently outperforms the frozen pretrained models. Moreover, we discuss rotation representations and associated loss functions that substantially improve policy learning. We present a novel task scenario designed to evaluate the progress in visuomotor policy learning, with a specific focus on improving the robustness of intricate assembly tasks that require both geometrical and spatial reasoning.

Supplementary Video

Frequently Asked Questions

Q: What are the main contributions of this work?


Q: How would this work benefit the community?

We believe that the robot manipulation learning community would benefit from the following insights with our comprehensive evaluation of the high-precision insertion task: 


Q: Why focus on such specific task?

To equip robots to do long-horizon assembly tasks (furniture-level), it is essential to enable robots to perform spatial-geometrical reasoning while being robust to variations from grasping these parts. With this aim, we propose this single task but with varying geometries and grasp variations. We believe a robust solution to the carefully constructed objects will provide us with insights and tools to develop a general solution for moving towards longer-horizon assembly. We would like to emphasize that the proposed task is still complex for learning-based methods, which is evident from our evaluation. 

Moreover, to our knowledge, we did not find any comparable benchmarks for object assembly tasks with SE(3) action space. Recent benchmarks for data-driven assemblies, such as Form2Fit and Transporter, uses top-down images to predict SE(2) + z (x,y,z,theta) top-down suction grasp locations, formulating the problem as a variant of pick and place. In SE(3) benchmarks such as Robomimic, the robot is tasked to generalize over spatial variations (varying location and orientation of objects), but not geometric variations (1 specific set of objects is used).


Q: How does it compare with other imitation learning methods?

We observed that the improvements in imitation learning that utilizes temporal consistency and relations (e.g. BC-RNN) do not offer much benefits to our task, as our task is in general less than 40 steps long. This is consistent with results from Robomimic [1], where only the longer 3 tasks (Square with 151 steps, Transport with 469 steps, and Tool Hang with 480 steps) showed visible difference in performance compared to the basic BC-MLP setup. However, we hypothesize that improvements capable of modelling multimodality through a different training objective (e.g. DiffusePolicy) may over significant improvements to the success rate and leave it for future work.

Supplementary Material

A. Frozen vs. Finetuning

Details of the Experiment: For a variety of pre-trained models, we train policies with the image encoder unfrozen (finetuned) and compare the success rates over 40 rollouts. The only difference is in the input dimension of the action decoder: with_prop has the current end effector pose concatenated to the image embeddings while no_prop has just the image embeddings. Note that all other experiments from the paper are done with proprioception. These models are trained on 100/1000 demonstrations, which includes all shapes, and 1 view (top cameras).

Observation from the Experiment: Unfreezing the perception modules for each of the pretrained networks during training has drastically different effects across each model. Models such as the R3M and ImageNet ResNet-50 that had relatively poor performance when frozen had greatly improved performance with fine-tuned perception modules. The most dramatic improvement can be seen in the ZTZR performance of the R3M ResNet-50 trained on 1000 episodes, with a jump from 0.225 frozen to 0.9 unfrozen. The performance of the unfrozen ImageNet ResNet-50 in each case matches or surpasses the best performing frozen models, such as CLIP ResNet-50. On the other hand, the success rates of both CLIP models completely collapse to 0 in all cases when unfrozen. We hypothesize that the special attention-pooling mechanism that provided very strong representations during frozen inference makes the out-of-distribution fine-tuning untrainably unstable with our hyperparameters. Even with the improvements in the pretrained models, in most cases they still fall short to the performances of the non-pretrained ResNet-18 and ResNet-50 models, with the ResNet-18 remaining the best performing architecture. This further reinforces our previous observation of ResNet-18 outperforming ResNet-50, showing that even though ResNet-50 has been extensively used for computer vision tasks, more compact or condensed representations might be better for policy learning.


B. Data Efficiency of Models

Details of the Experiment: Full table of successs rates with 3 views (top + 2 wrist views) and 100/1000/10000 train demonstrations of all shapes. All models are trained for 50000 steps with 1e-3 learning rate using a Adam optimizer.

Observation from the Experiment: For non-pretrained models, the average performance increases drastically as the amount of data increases. To specify, increasing from 100 episodes of demonstrations to 1000 episodes of demonstrations improves the average success rate by 0.27, from 0.38 to 0.65 (see the bottom two rows). This improvement in performance is consistent although not as drastic when increasing from 1000 episodes to 10000 episodes, with the average success rate increasing by 0.05, from 0.65 to 0.70. As the vision encoders are fine-tuned, they get better at identifying the spatial information relevant for the task, improving success rates. However, as we freeze pretrained vision encoders, those model variations rely solely on the MLP policy head for learning the trajectory. We hypothesize that the variations with pretrained vision encoders do not improve with more data for two reasons: the MLP policy head is saturated with just 100 examples and the features from models trained on out-of-distribution data lack geometric-spatial information necessary to complete the task.


C. With vs. Without Proprioception

Details of the Experiment: For Non-pretrained ResNet-18, CLIP ResNet-50, and R3M ResNet-50, we train new models without proprioception and compare the success rates over 40 rollouts. The only difference is in the input dimension of the action decoder: with_prop has the current end effector pose concatenated to the image embeddings while no_prop has just the image embeddings. Note that all other experiments from the paper are done with proprioception. These models are trained on 1000 demonstrations, which includes all shapes, and 3 views (top + 2 wrist cameras).

Observation from the Experiment: We observe that (1) proprioception is crucial for learning task variations with Z rotation and (2) proprioception reduces variance in models with lower-performing visual encoders (R3M ResNet-50). We hypothesize that the policy network benefits from explicitly knowing the robot's current state to accurately step toward the task goal. While in theory, the model can induce the current state of the object from images, we observe that explicitly stating the current state yields better results. Benefits from this explicit signaling is most clear when there are other perturbations along with Z rotation (XTZR, ZTZR, YRZR, XZTYZR).


D. Colored vs. Original Objects in Simulation

Details of the Experiment: We compare the performance of models trained on the  colored objects with more contrast between the base and the extrusion (shown in figures below). We train and evaluate Non-pretrained ResNet-18 and CLIP ResNet-50 using 1000 demonstrations of each task variation.

Observation from the Experiment: To our surprise, we observe that performance of models trained with colored objects are comparable but not better than models trained with original objects. For example, the model trained on colored ZR performs better than original, but the model trained on colored XTZR performs worse than original. There is not a clear trend on which tasks the models trained on colored objects perform well.

Colored Objects

Original Objects

E. Object Models

Every object model shares the same 8cm × 8cm × 8cm cubical base. All extrusions are standardized to have a height of 2cm and all intrusions to a depth of 2.5cm, but the tolerance of the shapes parallel to the block face varies from 1-4mm. For a more detailed look at the peg and hole models used in the task, Fig. 1 shows an example of a plus-shaped peg and hole pair with relevant measurements of the intrusion and extrusion. In general, the intrusions on the “hole” models are slightly larger than the extrusions on the “peg” models to allow for some tolerance when fitting the two objects together. The figure also shows an example of 3D printed peg and hole models in the real world, which adhere to the exact same scale as the objects used in simulation.


F. Training Hyperparameters

All models had a MLP policy head of sizes [1024, 1024, 18]. We use the original AveragePool and CLS-token outputs for downsampling the spatial dimensions of ResNet and ViT models, with the exception of CLIP ResNet-50 [1] which uses AttentionPool in the original implementation. All models were trained with the Adam optimizer [2], learning rate of 0.001, and batch size of 16, for 50000 steps. All models were evaluated with 40 randomizations of the same seeds to ensure that models were evaluated on the same set of unseen randomizations over different points in training.


G. Comparison on Specific Object Sets

We compare the performance on different object sets by training models on Order-all and evaluating them separately on 40 random initializations of Order-1, Order-2, Order-4 (Figure 1), and 45deg-Rotated (Figure 2). We observe that Order-1 objects are hardest to learn as expected, since they have the most number of possible rotations. To our surprise, however, we observe that Order-4 objects, which have 1 possible rotation and intuitively are the easiest to learn the policy on, do not perform the best for most task variations (Chart 1). Upon closer inspection, we realized that policies are learned from easiest to hardest object sets (Chart 2). Initially, the model learns to follow the general trajectory while ignoring any variations in the Z-rotation, achieving perfect accuracy for Order-4. Then, the second Z rotation is learned which is 90◦ apart from the default trajectory, improving results on Order-2. While this happens, the performance on Order-4 decreases, signifying a tradeoff between the performance on different object sets. Similarly, as the model learns geometric reasoning on the most challenging Order-1 objects, the performance on Order 4 and Order-2 further degrades. We hypothesize that this is due to (1) the multimodality of data where the possible variations of trajectories vary drastically depending on the object type and (2) the limited capacity of the vision encoder and the policy head.

References

[1] Mandlekar, Ajay, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. "What Matters in Learning from Offline Human Demonstrations for Robot Manipulation." In Conference on Robot Learning, pp. 1678-1690. PMLR, 2022.