Learning What To Do by Simulating the Past

David Lindner

ETH Zurich

Rohin Shah

UC Berkeley

Pieter Abbeel

UC Berkeley

Anca Dragan

UC Berkeley

Abstract

Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state (Shah et al., 2019). Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in grid worlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.

Supplementary Material

In our paper, we present experiments in the MuJoCo simulator, in which we sample a small number of states from a policy and then use the Deep RLSP algorithm to learn to imitate this policy.

On this website, we provide clips from the final policies learned by Deep RLSP and compare them to the original policy. Additionally, we show policies learned by our ablations AverageFeatures and Waypoints, as discussed in the paper. As a baseline we show policies learned by GAIL.

For the jumping and balancing behavior the videos allow a qualitative evaluation of the policies similar to Figure 1 in the paper. For the locomotion behaviors the videos illustrate the resulting policies reported in Table 1.

Videos

Learning skills from a single state
- Cheetah balancing
- Cheetah jumping
Learning to solve MuJoCo locomotion tasks form a single state

Citation

David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan. Learning What To Do by Simulating the Past. In International Conference on Learning Representations (ICLR), 2021.

Bibtex

@inproceedings{lindner2021learning,

title={Learning What To Do by Simulating the Past},

author={Lindner, David and Shah, Rohin and Abbeel, Pieter and Dragan, Anca},

booktitle={International Conference on Learning Representations (ICLR)},

year={2021},

}

Google Sites

Report abuse