The goal of this work is to explain the recent success of domain randomization and data augmentation through the lens of causal inference. We posit domain randomization and data augmentation as interventions on the environment which encourage invariance to irrelevant features. Such interventions include visual perturbations that have no effect on reward and dynamics. This encourages the learning algorithm to be robust to these types of variations and learn to attend to the features that are relevant for solving the task at hand.
This connection leads to two key findings: (1) perturbations to the environment do not have to be realistic, but merely show variation along dimensions that also vary in the real world, and (2) use of an explicit invariance-inducing objective improves generalization in sim2sim and sim2real transfer settings over just data augmentation or domain randomization alone.
We demonstrate the capability of our method by performing zero-shot transfer of control policies learning from pixel observations and solving reach and cube lift tasks, on a real 7DoF Jaco arm.
The envrionments with randomized visual rendering. To the left are the training environments, and the environment to the right is one example of an unseen test environment. We used 3 test environments to evaluate the final performance of each method.
The plot above shows the evaluation results on unseen environments at test time (top row), and the performance on seen environment from training (bottom row). These plots attempt to show the effect of overfitting of the policy on the environments seen during training, and the real generalization performance at test time on new unseen
Comparison of methods with and without PRI (post-rendering interventions), and with and without RI (rendering interventions) on unseen (top) and seen environments (bottom)
t-SNE comparison of the latent representations learned by IBIT and DrQ. colours represent the predicted value by the critic. Our method successfully learns an invariant representation to domain randomization, while DrQ does not