In this page, we will show some results that are not included in the paper. In particular, we will present two ablation study results.
G-CompACT with and without image feedback
G-CompACT trained with and without adding the noise to the reference frame
Our middle-layer visuomotor policy, G-CompACT, is motivated by the lack of accuracy of the high-level planner, Diff-EDF. Here, we briefly introduce the Root-mean-squared-error (RMSE) values of the Diff-EDF to the training dataset.
We test the G-CompACT with and without the image feedback. We add the noise, weighted by the RMSE value from Table 1, to the reference frame. We fed the same noisy reference frame information to both modules (with and without image feedback) to ensure fair comparison.
The policy without vision feedback is similar to the case when a human grasps a peg and tries to put it into a hole with eyes closed. The intuitive strategy in this case is first to find a surface where the hole is located and then perform a random or spiral search. The G-CompACT without image feedback is similar to a near-random exploration around the contact surface and results in a low success rate.
In contrast, the policy with image feedback consistently achieved 100% success, as it can continuously infer corrections towards the actual reference frame in real time. This justifies the choice of providing image feedback.
The comparison of G-CompACT trained with and without the noise added to the reference frame is presented in Table 3 below. In this case, we injected Gaussian noise scaled to a 2x RMSE value from Table V of the revised manuscript. The model trained with noise exhibited higher success rates, demonstrating improved robustness to pose estimation errors.