Some of the omitted contents that are not included to the paper due to the page limits are shown here.
Deep RL is utilized in addition to behavior cloning (BC) to demonstrate our concept further. We applied the soft-actor-critic (SAC) algorithm since it is known to explore the state space more and less likely to fall into the local optimum compared to other deep RL algorithms.
Solving the Peg-in-Hole (PiH) task using RL is challenging because PiH is an exploration-sensitive problem. Previous research tackling the PiH task with RL typically uses sparse reward signals, where a high positive reward is given only when the task is accomplished. To address this exploration issue, we leverage the off-policy nature of the SAC. Being off-policy, the trajectories in the replay buffer have a little bit of freedom to be sampled without using the current policy. Therefore, we sample trajectories using the scripted expert policy every 10 epochs and inject them into the replay buffer. The number 10 is selected to mitigate the distributional shift issue.
We propose to use a minimalist approach to solving PiH tasks using RL by reducing the dimension of the action space to 2, denoted as "a_RL = [a_RL1, a_RL2]^T". To map this reduced action space back to the original action space, we define a mapping as "a = [a_RL1, a_RL1, a_RL2, 1, 1, 1]^T". This proposal is motivated by several reasons. First, we observed a symmetric pattern in the expert policy design to finish the task. The rotational gains can always be set to be their maximum value (which turns out not to hold anymore in the real-world environment, where the uncertainties on the goal position are present), and the gains in the x-y directions can be considered symmetric (which may not hold if the shape of the peg and hole changes). By enforcing a symmetric structure and reducing the action space, the RL agent can learn the policy more efficiently as it does not need to explore unnecessary nonsymmetric action spaces. Additionally, compared to utilizing the full 6 DOF gains, RL agents with smaller action spaces show more consistent learning dynamics.
The following reward function is considered for the RL.
"Psi(g,g_d)" is a SE(3) distance metric incorporating the translational and rotational distances
"d_p(p,p_d)" is a x-y plane translational distance.
"f_ez" is a force sensor input in body-frame z direction.
The model with the highest validation return is chosen for evaluation. The results are summarized on the 5th and 6th rows of the Table below:
As the other experimental results with BC, the CIC+CEV and GIC+GCEV both showed nearly perfect success rates in the training environment. However, the benchmark approach failed to transfer the trained policy to different tasks, resulting in lower success rates as compared to the BC results. This could be attributed to the overfitting of the RL policy to the default case due to its reward-maximizing property. Overall, the experimental result of RL almost resembles that of BC, which is the main reason why we decided not to put RL results into the paper. In addition, the minimalist approach is highly likely to lack the generalization to the other types of tasks and situations. Therefore, we will deal with GIC+GCEV using RL in a more detailed manner in the next versions of the papers. In particular, we aim to learn insertion primitives via RL in the SE(3) equivariant manner.
The proposed approach can be directly extended to other types of contact-rich robotic manipulation tasks, such as surface wiping and pick-and-place. Here, we demonstrate the feasibility of the proposed approach to the simple surface-wiping task while exerting a certain amount of normal force on the surface. Note that this surface-wiping task can be considered as an exemplary case of robotic surface polishing, plastering in construction sites, and even painting.
The end-effector is assigned to follow the circular trajectory with a radius 0.1 m while exerting 7.5N to half of the circle and 15N to the other half of the circle. We let g_d as the center of the circle so that the tasks' progress can be well described by GCEV ("e_G") or CEV ("e_C"). Following the problem setup of the proposed and benchmark approach in the PiH task, we have collected the expert's gain scheduling policy together with GCEV and CEV in the default case, where the surface is horizontally placed. Then, the trained gain scheduling policy is directly tested in tilted surface cases. The expected results are as follows: when the desirable gains are applied, the resulting trajectories will follow the desired trajectory well enough because the high gains should be applied to the tangential direction of the surface while appropriate gains to have the desired force are applied to the normal direction of the surface. We have tested the GIC+GCEV approach and the CIC+CEV approach. See figures below for the problem situation
Surface-wiping on horizontal surface
Surface-wiping on 30-degree tilted surface
The simulation results for the trajectory tracking are shown below. As can be seen in the first figure, both approaches (GIC+GCEV and CIC+CEV) showed almost perfect tracking performance in the training case. However, as the tilting angles are applied, the CIC+CEV fails to successfully track the desired trajectory, while the GIC+GCEV follows the desired trajectory successfully, even in highly-tilted cases. Note that original desired trajectories are inside the surface to exert the designated force, and the desired trajectories shown are the projected ones.
3D trajectory on horizontal case
3D trajectory on 30deg tilted case
3D trajectory on 45deg tilted case
3D trajectory on -45deg tilted case
The normal force exerted to the surface are shown in the figures below. The similar results can be shown. Both the CIC+CEV and GIC+GCEV follows the force profile in the trained case, but as the tilting angles are applied, the CIC+CEV heavily fails while GIC+GCEV works.
Force profile on horizontal case
Force profile on 30deg tilted case
Force profile on 45deg tilted case
Force profile on -45deg tilted case