Behavior Cloned Reset Policy Details
In most cases, our behavior cloned reset policy is capable of resetting the environment, or at least of making contact with the object, but there are a few states where the policy is unable to pick up or perturb the object in any way. In order to avoid getting stuck attempting unsuccessful resets in these states, we train two different reset policies. One is trained with reset demonstrations for multiple objects, while the other is trained with demonstrations for only the current experiment's object. For example, when running an experiment with the football, one policy is trained using reset demonstrations for the 3-pronged object, the T-shaped pipe, and the football, while the other is trained only with demonstrations for the football. At the start of each training episode, we select the multi-object reset policy with an 80% probability and the single-object reset policy with a 20% probability. Since the policies behave differently due to being trained on different data, states in which one policy might get stuck are unlikely to cause the same issue for the other policy, which enables training to continue even if one of the two policies is suboptimal.