ImageManip: Image-based Robotic Manipulation with Affordance-Guided Next View Selection

Supplementary Materials

1. Real-World Experiment

Real-World Demonstration:

icra.mp4

We provide a video demonstration of the manipulation process. The video demonstrates that our approach is capable of handling specular or transparent objects successfully. Moreover, even when the depth estimation is not precise enough to enable direct contact, our method utilizes sensor-based contact detection to facilitate the manipulation process. This capability allows us to achieve successful object manipulation despite the limitations of depth estimation accuracy.

Comparisons with Point Cloud:

To substantiate our findings, we offer additional qualitative evidence within a real-world setting. All experiment specifics align with the details in the main paper, aiming to illustrate the limitations of point clouds. As illustrated in [xxx], it becomes evident that point clouds exhibit notable limitations when confronted with specular or transparent objects (such as the pan and kettle). For example, due to its intrinsic nature, the faucet comprises sparse points, introducing complexities in both perception and manipulation. Take the pan as a further illustration – although the lid is transparent, the handle is not, and the transparent component still affects the overall imaging, intensifying the difficulties in perception and manipulation. In contrast, cost-effective RGB images exhibit dense pixel-wise information to capture specular and transparent objects. Therefore, our motivation for this paper is to provide a potential solution for an RGB-only manipulation framework, which has its own strength in industrial applications. Though such modality lacks geometry details, we complement by predicting depth to formulate object geometric structure.

2. Token-Wise Correspondence

We show the visualization of the comparisons of pixel-wise and token-wise correspondence. Pixel-wise correspondence is obtained with initial depth prediction, which may contain inaccuracy to lead to error accumulation. In the top part of the following figure, we can see the misalignment circumstance in pixel-wise correspondence. To address such, we introduce token-wise correspondence which finds the correspondence of tokens in high-dimensional feature, in which one token aggregates the feature of 64x64 pixels. It thus is more robust to depth inaccuracy since it does not require accurate pixel-level alignment. We visualize the token-wise correspondence at the bottom part of the figure and only fuse correspondent tokens. By doing so, we ensure that the fused tokens have the same semantic representation, and thus alleviate the impact of inaccurate depth prediction.

3. More Ablation Study On The View Selection Module

Number of Candidates:

Our investigation into the impact of candidate quantity on the module involved dividing the potential space into 4/9/16 candidate regions based on predefined 3D space with distance (2.5 units to 4.5 units) and azimuth angle ranges. On 10 seen categories under pushing primitives, 4/9/16 candidates yield average accuracies of 0.72/0.73/0.73. It becomes apparent that a greater number of candidate regions did not result in significant performance improvement, as neighboring views demonstrate minimal variation across the entire potential space. The growth in candidates didn't lead to a corresponding increase in information brought by different views.

Candidate Scope:

We analyze the impact of the range of the whole potential scope by expanding the predefined distance range to (1.0 units to 5.0 units) and the azimuth angle range to (40 degrees) while subdividing the potential space into nine candidate regions to place the camera. Comparing to the scope we described in the main paper, this experiment reveals a slight performance decrease, from 0.73 to 0.71. Since enlarging the predefined space will widen the range for each candidate space, it potentially introduces more ambiguity when positioning the camera in the broader candidate area. Meanwhile, we adopt a random next view selection here, which achieves more performance drop to 0.5. As shown in the following figure, since in a more broad 3D space, the views show a great difference. Therefore, randomly placing can not ensure capturing valuable views for manipulation prediction. This shows the importance of our best next-view selection strategy, especially in a large scope. Meanwhile, when we reduce the potential space (2.5 units and 10 degrees) and generate the nine candidate regions, the performance drops to 0.66. This suggests that a smaller potential scope might not encompass sufficiently informative views to complement the global perspective since views are too similar, and the optimal view can easily lie out of the scope. In the figure above, we provide a visualization of the next views captured beyond the specified range: (a) and (b) exceed the allowed distance, while (c)-(f) have azimuth or altitude angles outside of the valid range. These images demonstrate the limited valuable information they provide for the previous global view. By setting the space range for capturing the next view based on experimental experience and observation, we ensure that the subsequent views contribute essential information, enhancing the learning process and enabling a more comprehensive understanding of the object and its articulated parts.

Upper Bound:

Given the case of 9 camera pose candidates, we establish an upper bound for the view selection module. We place the camera in each candidate region, generating refined affordance and manipulating the object. We define a successful manipulation if it achieves success once among the nine trials. In this scenario, we reach an accuracy of 0.74, which can be considered as the upper limit. Also, in section IV.C 'with random next view', we provide the lower limit of this module by randomly placing the camera among the nine candidates to capture the next view, which achieves 0.70 accuracy. By comparing with the upper and lower bound, our view selection module shows its effectiveness in selecting informative views by yielding an accuracy of 0.73.

Visulization of view selection module:

We show the improvement of refined affordance prediction compared to the initial affordance prediction. By incorporating the valuable visual information from the selected next view, the refined affordance maps exhibit improved precision and clarity, particularly in the boundary regions. Compared to the initial affordance map, the refined affordance map provides a more accurate representation of the object’s affordance.

We further demonstrate the refined affordance prediction given to nine candidates. Different candidates generate different refined affordance predictions, showing the necessity of applying the view selection module. We observe that different next views capture distinct object details, leading to variations in the refined manipulation maps after fusion. This observation further verifies the significance of the view selection module in our approach.

4. Domain Randomization:

The randomization in object material and lighting is shown in the figure above. By doing so, we aim to ease the process of the sim to real transfer since the real world usually contains diverse scenarios.