Additional Material

Baselines (Implementation Details)

We chose the size of the DMP parameters from a grid search of the following values: [5, 10, 20, 40, 60, 70, 80, 100, 125, 250].

Our goal with the considered comparisons is to assess the capacity of the proposed ideas to adapt in the considered contact-rich insertion set ups. We used the publicly available implementations of eNAC, FD and PoWER algorithms provided on the web page of the authors as reference. We used update step of 1e-4, with 7e-2 standard deviation for exploration for eNAC, exploration magnitude of 10, update step of 0.1 and 5 basis functions for FDG and an initial variance of 4000 for PoWER. All parameters were chosen empirically over a grid search comparison. We used an adapted version of tf-agents' implementation of SAC and PPO as below.

We found that recursive least squares (RLS) performed best compared to other linear policies and thus report it as representative of the linear policy family. We used publicly available implementation provided as part of padasip\footnote{\url{https://pypi.org/project/padasip/}}. The associated reference~\citep{kumar2016optimal} aims to give attribution to the idea of adapting with a recursive linear policy and does not make a general statement of the performance of the full solution. We used separate filters with scalar outputs for each of the 3 translational output dimensions of the policy. We assume state-dependent predictions, but a time dependent policy due to the nature of the quadratic cost. Inputs were of degree 2 for each dimension; e.g. $(x^2, 2xy, 2xz)$ was used as input for the prediction of the x axis. Then, an actions was defined by the difference between the RLS predictor and the current state. Uniform policy assumed the same structure but sampled actions from a uniform distribution.

Environment perturbations by applying unexpected external forces to the robot do not prevent rLfD from successfully inserting.

All demonstrations are provided via teleoperation using a vive set up. Demonstrations are recorded at 250Hz.

DMPs (Implementation Details)

Running pure RL on the physical robot

Our comparison against RL in simulation confirms that pure RL can achieve similar to rLfD generalisation to initial start configurations but not to the harder task with a tighter, never-seen-before hole. There, the pure RL agent was 13.2% less accurate than rLfD. Moreover, a pure RL required 18 times longer training, an engineered reward and an increase in the magnitude of the actions taken to learn. An increased action magnitude meant less safe exploration. We found this to be detrimental to the physical environment, leading to a few broken plugs, and a few joint limit thresholds reached by the robot. The large training requirements meant that the pure RL agent requires at least 2 days of non-stop training to get a successful policy in simulation too. This period is even longer after reducing the action magnitude to a safer scope. A reduced action size (although twice as large as the residual actions in rLfD) prevented the robot from learning anything useful too leading to 0% success so we did not add pure RL to the physical experiment in Figure 5.

SAC

PPO

Physical Robot

Performance along each axis on the peg insertion task for each baseline.