Using entropy-regularized imitation learning to bridge behavioural cloning and inverse reinforcement learning.
Summary
Perform KL-regularized behavioural cloning on the demonstration dataset.
Recall definition of a shaped reward function and critic, which have the same optimal policy as the true reward and critic.
Policy invariance under reward transformations: Theory and application to reward shaping, Ng et al. (1999)
Invert the soft- / posterior policy iteration update to obtain a shaped reward and critic, which can be used to improve the initial policy with additional data.
Stationary policies in the tabular and continuous settings
Stationary policies are straightforward to implement in the tabular setting (left). The equivalent policy in the continuous setting (center) is harder to implement. We adopt a special neural network architecture that approximates the desired stationary behaviour (right).
Continuous control from agent demonstrations
Continuous control from human demonstrations from states
Continuous control from human demonstrations from images