Online adaptation is crucial for systems deployed into the real world; this skill requires the ability to use prior knowledge to quickly adapt to new tasks and environmental perturbations. In this work, we propose a meta-learning approach for learning online adaption. In order to achieve low sample complexity in the meta-training phase and enable real-world application, we study the online adaptation problem in the context of model-based reinforcement learning. Our approach efficiently meta-trains a global dynamics model that learns to use recent data in order to perform quick adaptation. We introduce two instantiations of this approach: recurrence based adaptive control (RBAC) and gradient based adaptive control (GBAC). Finally, we demonstrate successful online adaptation on several simulated robotic control tasks with complex contact dynamics.
In the following plots, we show learning progress for 3 agents during training.
1. In all cases, our methods (GBAC and RBAC) achieve high returns in the low data regime.
2. Although MB sometimes displays a training performance comparable to that of GBAC/RBAC, its test-time performance is still lower than the other methods. This is because MB is not able to adapt to outside of distribution tasks. Also, the performance of GBAC and RBAC was boosted by using a more fine-grained action optimization at test time, while MB did not experience this increase of performance (due to the model is not as accurate).
Ant crippled leg
Arm force perturbations
Half-Cheetah immobilized joint
At each timestep, GBAC takes 1 gradient step to adapt its parameters from theta*. We measured the effect of these parameter updates, and show them in the histograms below. We compute the model error over the next K steps of the adapted model, as well as the model error over the next K steps of the pre-updated model (i.e. non-adapted model). The adapted model error distribution is clearly lower than that of the pre-updated model error. Thus, we can deduce that adaptation is important.
Half-Cheetah pier, run #1
Half-Cheetah pier, run #2
Half-Cheetah sloped terrain, run #1
Half-Cheetah sloped terrain, run #2
To see how training distribution affects test performance, we ran an experiment that used GBAC to train models of the 7-DOF arm, where each model was trained on the same number of datapoints during meta-training, but those datapoints came from different ranges of force perturbations. We observe (in the plot below) that
1. Seeing more during training is helpful during testing --- a model that saw a large range of force perturbations during training performed the best
2. A model that saw no perturbation forces during training did the worst
3. The middle 3 models show comparable performance in the "constant force = 4" case, which is an out-of-distribution task for those models. Thus, there is not actually a strong restriction on what needs to be seen during training in order for adaptation to occur at train time (though there is a general trend that more is better)
1) Half-cheetah: immobilized joint (i.e., the agent cannot apply torques to that joint).
Test on immobilizing joint from training dist, immobilizing joint NOT in training dist, and immobilizing joint during a rollout.
2) Half-cheetah: sloped terrain
Test on gentle upward slope, steep upward slope, and steep up/down hill.
3) Half-cheetah: pier with sections of different damping
Test on a random configuration of pier damping
4) Ant: crippled leg (i.e., the agent cannot apply torques to that leg, and the leg itself is shrunk).
Test on crippling leg from training dist, crippling leg NOT in training dist, and crippling leg during a rollout.
5) 7-DoF arm: force perturbations
Test on small perturbations, large perturbations, and constantly changing perturbations.