Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience
Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, Dieter Fox
We consider the problem of transferring policies to the real world by training on distribution of simulated scenarios. Rather than manually tuning the randomization of simulations, we adapt the simulation parameter distribution using a few real world roll-outs interleaved with policy training. In doing so, we are able to change the distribution of simulations to improve the policy transfer by matching the policy behavior in simulation and the real world. We show that policies trained with our method are able to reliably transfer to different robots in two real world tasks: swing-peg-in-hole and opening a cabinet drawer.
Are the learnt parameters like friction / damping etc. physically accurate?
In general, since we don't know the friction, damping, compliance etc. in the real world it may be possible that our algorithm somehow learns to transfer from simulation to real world without actually converging on parameters that are physically accurate i.e. our algorithm can learn to cheat. There is no way to verify if the learnt friction is actually the friction in the real world. For instance, the algorithm may learn the friction of the gripper which may be useful for opening the drawer but may not work if you choose a different task.
However, we believe such situations arise in various real world examples e.g. discovering depth from monocular camera is never metric in scale but a stereo pair with known baseline can give depth in metric scales. Such gauge freedom can only be resolved via some form of external calibration or making sure we have multiple tasks tied together and learnt. Therefore, we would advice to add some external priors if possible to make sure you can use the learnt parameters elsewhere.
What if the covariance matrix collapses?
This is a possible scenario that can occur if we continue to optimise with real world trajectories over time. Since each optimisation iteration decreases the covariance it is possible that covariance can quickly collapse after a few iterations. The way around is to add a minimum covariance to your optimised covariance so it doesn't reach zero. This is akin to adding lamda in Levenberg-Marquart optimisation when inverting the Hessian.
Why did we choose to go with REPS instead of CMA-ES?
REPS is basically KL-divergence constraint added to the cost function used in CMA-ES. This ensures that the learnt parameters don't deviate too much from their original values to avoid singularities and helping to converge to stable and sensible values. In principle, you should be able to use either but REPS is more conservative and respects the KL-divergence constraint for stability.
Human is still needed to tune initial values?
It is true that human is still needed to initialise the covariances of the parameters. However, we have observed that initialising the covariance to very large values in an attempt to cover a wider operating range exacerbates the learning process --- extreme randomisation isn't very helpful. Our covariance initialisation tends to be more conservative and therefore optimised over time.
In classic domain randomisation, a user will tune the parameters and optimise for zero-shot transfer but after experimenting in the real world may realise that some parameters may need to be tuned again therefore go back to the algorithm and tune them manually and repeat the process until the algorithm starts to work in the real world --- this is particularly true for randomising physics parameters. In this work, we have automated the processing of a human going back and forth and tuning the parameters. However, initialisation still has to be provided by the human. We believe this is not a huge limitation (our initialisations just have to be conservative - that's it) as such a process is generally needed for various algorithms e.g. choosing a learning rate for an algorithm, choosing the weight initialisation of the network or in fact choosing the observation uncertainty in Kalman Filter are all initialised by hand.
In short, although it still requires manual parameter selection, finding a conservative distribution is easier than designing a wide distribution. The figure below shows how the covariance matrix changes as more real-world rollouts are collected over time.