Latent Action Priors from a Single Gait Cycle Demonstration
Oliver Hausdörfer, Alexander von Rohr, Eric Lefort, Angela P. Schoellig
Paper [arXiv] Code [GitHub]
We propose to learn a latent action representation from expert demonstrations and subsequently use them as a prior in Deep Reinforcement Learning for locomotion tasks. The latent action priors can be learned from a single gait cycle of expert demonstration consisting of only a few datapoints (5-106 frames). The expert data can be generated by an open-loop controller. Learning from such un-diverse data is typically difficult for imitation learning. We show that combining our latent action priors with style rewards is particularly useful for imitating the expert.
Fig. 1: Method. We learn a latent action representation using a simple autoencoder from a single gait cycle of expert demonstrations. These actions are used as a prior in deep reinforcement learning (DRL) via the decoder. During DRL training, only the policy is optimized, not the latent actions decoder. We combine our approach with style rewards for imitation.
One gait cycle demonstration
Baseline PPO
PPO+latent action prior
PPO+latent action prior+style
Transfer tasks
For the following transfer tasks, we use the same gait cycle of expert demonstration as above. Interestingly, for 4x target speed we observe a gait transition to a galloping gait.
2x target speed
3x target speed
4x target speed
Any target direction
Other Environments
We use the following gait cycles of expert demonstrations for Half-Cheetah, Ant, Humanoid, and Unitree H1.
Results after Deep Reinforcement Learning (PPO) with latent action priors and style rewards. For Humanoid we use only the latent action priors.
Two Unitree A1 task
Two Unitree A1's jointly need to solve the task and transport the rod to the target location. The task is solved once the rod is within 0.1m of the target, and the target is randomly sampled every episode. Only PPO+latent action prior+style solves the task, which shows that the prior information enables solving of new tasks. The same single gait cycle demonstration for the Unitree A1 as above is used.
Baseline PPO
PPO+latent action prior+style
Please refer to the paper for full results. Cite the project as: