Data-Driven Action Prior and Style Reward for Quadruped Locomotion
Anonymous Author(s)
Code [TBA] Code [TBA] Paper [TBA]
On the website, we show the full video and LLM prompts.
LLM-designed open-loop controller as demonstration
Deployed as steerable locomotion policy
(2x speed)
Generalization to rough terrains
(2x speed)
Policy in direct torque
control mode with one additional reward term
LLM prompt trotting
LLM prompt walking
LLM prompt stairs
Tested in benchmark environments
We benchmark our approach in a variety of simulated environments based on existing datasets demonstrations.
First, we obtain the demonstration.
Demonstration by open-loop controller (A. Raffin, 2021)
Demonstration by open-loop controller (A. Raffin, 2021)
Demonstration by
SAC policy
Demonstration from
expert dataset (F. Al-Hafez, 2023)
Second, we train a robust policy with a latent action prior and a single style reward term using PPO.
Learned policy
Learned policy
Learned policy
Learned policy
Latent action priors improve policy learning measured by task reward across all environments. Adding a single style reward term additionally improves the visual appearance of the gait in simulation for demonstrations generated with open-loop controllers.
Action Prior + Style rewards leads to robust imitation
The combination of a latent action prior with a single style reward term leads to robust imitation of the demonstration.
Demonstration data
(F. Al-Hafez, 2023)
PPO
w/o reward tuning
PPO
+ latent action prior
PPO
+ latent action prior + style
Please refer to the paper for full results.