Learning a subspace of policies for online adaptation in
Reinforcement Learning
Jean-Baptiste Gaya Laure Soulier Ludovic Denoyer
Jean-Baptiste Gaya Laure Soulier Ludovic Denoyer
This website presents supplementary material for our paper.
Learning a policy works well in a classical setting, but considering that the environment at train time and the environment at test time are similar is unrealistic in many practical applications. Therefore, after being trained, one should operate an additional fine-tuning procedure to reach an optimal policy on the test environment. This may be expensive in terms of interactions.
Instead of learning a single policy at training, our model aims to learn a Line of Policies (LoP) in the space of policy parameters. This methods inspired from recent research in mode connectivity [1] enables a diversity of behaviors during inference that makes the agent more robust to variations. The objective that our algorithm maximize is:
Like [2] and [3], We consider the simple yet hard to tackle setting where an agent is trained on a single environment. At test time, it has a budget of K "free" episodes to adapt to a variation of the training environment. LoP makes things easy: we evaluate K different policies whom parameters are uniformly spread over the convex subspace defined by θ_1 and θ_2.
We designed a LoP version of Proximal Policy Optimization and tested it in complex continuous control tasks thanks to Brax engine [4]. While our model is trained on the vanilla environment, it is evaluated on a bunch of test environments where both agents' morphology and environment physics can vary (like in [5]). We present below a selection of these test environment and the "5-shots" trajectories of the same LoP model trained on HalfCheetah. For each environment, a nearly-optimal policy is found and the agent is able to adapt.
Reward = 0
Reward = 1675
Reward = 5736
Reward = 2194
Reward = 1102
More qualitative examples on HalfCheetah variations below: