Learning a subspace of policies for online adaptation in
Reinforcement Learning

Jean-Baptiste Gaya Laure Soulier Ludovic Denoyer

This website presents supplementary material for our paper.

The problem of learning a single policy

Learning a policy works well in a classical setting, but considering that the environment at train time and the environment at test time are similar is unrealistic in many practical applications. Therefore, after being trained, one should operate an additional fine-tuning procedure to reach an optimal policy on the test environment. This may be expensive in terms of interactions.

Our approach: learning a Line of Policies

Instead of learning a single policy at training, our model aims to learn a Line of Policies (LoP) in the space of policy parameters. This methods inspired from recent research in mode connectivity [1] enables a diversity of behaviors during inference that makes the agent more robust to variations. The objective that our algorithm maximize is:

Online adaptation and K-shot setting

Like [2] and [3], We consider the simple yet hard to tackle setting where an agent is trained on a single environment. At test time, it has a budget of K "free" episodes to adapt to a variation of the training environment. LoP makes things easy: we evaluate K different policies whom parameters are uniformly spread over the convex subspace defined by θ_1 and θ_2.

Experiments: morphological and dynamics variations

We designed a LoP version of Proximal Policy Optimization and tested it in complex continuous control tasks thanks to Brax engine [4]. While our model is trained on the vanilla environment, it is evaluated on a bunch of test environments where both agents' morphology and environment physics can vary (like in [5]). We present below a selection of these test environment and the "5-shots" trajectories of the same LoP model trained on HalfCheetah. For each environment, a nearly-optimal policy is found and the agent is able to adapt.

Reward = 0

Reward = 1675

Reward = 5736

Reward = 2194

Reward = 1102

More qualitative examples on HalfCheetah variations below:

Huge torso

Disproportionate feet

High gravity

High friction

Thin legs

References

[1] M. Wortsman, M. Horton, C. Guestrin, A. Farhadi, and M. Rastegari. Learning neural network subspaces, 2021.In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 11217–11227. PMLR, 2021. URL: http://proceedings.mlr.press/v139/wortsman21a.html.
[2] S. Kumar, A. Kumar, S. Levine, and C. Finn. One solution is not all you need: Few-shot extrapolation via structured maxent RL, 2020.CoRR, abs/2010.14484, 2020b. URL: https://arxiv.org/abs/2010.14484.
[3] T. Osa, V. Tangkaratt, and M. Sugiyama. Discovering diverse solutions in deep reinforcement learning, 2021.
URL: https://arxiv.org/abs/2103.07084
[4] C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax - a differentiable physics engine for large scale rigid body simulation, 2021. URL: http://github.com/google/brax.
[5] P. Henderson, W.-D. Chang, F. Shkurti, J. Hansen, D. Meger, and G. Dudek. Benchmark environments for multitask learning in continuous domains, 2017.
ICML Lifelong Learning: A ReinforcementLearning Approach Workshop, 2017. URL: https://arxiv.org/abs/1708.04352

Google Sites