Latent Action Priors for Locomotion with Deep Reinforcement Learning
Oliver Hausdörfer, Alexander von Rohr, Eric Lefort, Angela P. Schoellig
Code [GitHub] Paper [arXiv]

Abstract. Deep Reinforcement Learning (DRL) enables robots to learn complex behaviors through interaction with the environment. However, due to the unrestricted nature of the learning algorithms, the solutions are often brittle and appear unnatural. This is especially true for learning direct torque control, as inductive biases are more difficult to facilitate than for position control. We propose an inductive bias for learning locomotion that is especially useful for torque control: latent actions learned from a small dataset of expert demonstrations. Combining latent action priors with one style reward term for imitation leads to desirable locomotion patterns. Despite using only little demonstration data, the agent is not restricted to the reward levels of the expert demonstration, and we observe significant improvements in transfer tasks.

Fig. 1: Method. We propose a latent action space prior for Deep Reinforcement Learning (DRL) that is especially useful for learning locomotion in direct torque control. Based on a small dataset of expert demonstrations, we learn a latent representation of the expert's actions. That representation is subsequently used as an action prior in DRL. We combine our approach with an imitation style reward from Peng et al., 2018, based on the same expert data.

Demonstration data
(for instance from a simple feedforward controller that prescribes joint positions)

Baseline PPO

PPO + latent action prior (ours) from demonstration data

PPO + latent action prior (ours) + style from demonstration data

Transfer tasks

We evaluate how well the latent action priors trained on one task transfers to other tasks, i.e. moving at different target speeds. For the following transfer tasks, we use the expert demonstration as above. We observe that the prior helps in all the transfer tasks for maximum achieved reward and visual appearance of the gait.

2x target speed

3x target speed

4x target speed

Any target direction

Other Environments

We evaluate our approach on a variety of environments (Half-Cheetah, Ant, Humanoid, and Unitree H1). The following shows the video clips we use as demonstration data.

The following results below are obtained from PPO + latent action priors (ours) + style rewards.

Additional Task

Two Unitree A1's jointly solve the task and transport the rod to the target location. The task is solved once the rod is within 0.1m of the target. We still use the same expert demonstration from above. Results show, that the latent action priors are beneficial to solve complex tasks.

Baseline PPO

PPO+latent action prior+style

Please refer to the paper for full results.

Latent Action Priors for Locomotion with Deep Reinforcement LearningOliver Hausdörfer, Alexander von Rohr, Eric Lefort, Angela P. SchoelligCode [GitHub] Paper [arXiv]

Latent Action Priors for Locomotion with Deep Reinforcement Learning
Oliver Hausdörfer, Alexander von Rohr, Eric Lefort, Angela P. Schoellig
Code [GitHub] Paper [arXiv]