Deep Online Learning via Meta-Learning:

Continual Adaptation for Model-Based RL

Anusha Nagabandi, Chelsea Finn, Sergey Levine

Humans and animals can learn complex predictive models that allow them to accurately and reliably reason about real-world phenomena, and they can adapt such models extremely quickly in the face of unexpected changes. Deep neural network models allow us to represent very complex functions, but lack this capacity for rapid online adaptation. The goal in this paper is to develop a method for continual online learning from an incoming stream of data, using deep neural network models. We formulate an online learning procedure that uses stochastic gradient descent to update model parameters, and an expectation maximization algorithm with a Chinese restaurant process prior to develop and maintain a mixture of models to handle non-stationary task distributions. This allows for all models to be adapted as necessary, with new models instantiated for task changes and old models recalled when previously seen tasks are encountered again. Furthermore, we observe that meta-learning can be used to meta-train a model such that this direct online adaptation with SGD is effective, which is otherwise not the case for large function approximators. In this work, we apply our meta-learning for online learning (MOLe) approach to model-based reinforcement learning, where adapting the predictive model is critical for control; we demonstrate that MOLe outperforms alternative prior methods, and enables effective continuous adaptation in non-stationary task distributions such as varying terrains, motor failures, and unexpected disturbances

Overview of Comparisons

MOLe (ours): meta-learning for online learning, with our EM algorithm for building/maintaining a mixture of models to handle non-stationary task distributions

k-shot adaptation with meta-learning: meta-learning for adaptation, with each adaptation step occurring from the meta-trained prior theta*

(Standard k-shot learning setup. Has limited ability to adapt to more extreme/out-of-distribution tasks)

continued adaptation with meta-learning: meta-learning for adaptation, with each adaptation step occurring from the parameters of the previous time step.

(This often overfits to recently observed tasks, so it should indicate the importance of our method effectively identifying task structure to avoid overfitting and enable recall)

model-based RL: non-meta-learned vanilla model-based RL method, with no adaptation.

model-based RL with online gradient updates: same model as the model-based RL method, but adapts it online using gradient descent at each timestep

Crippling of End Effectors on Six-Legged Crawler

Hexapedal crawler (S ∈ R50, A ∈ R12)

Train: all models are trained on random joints being crippled (i.e., unable to apply actuator commands).

Test: test tasks involve out-of-distribution tasks, as explained below


k-shot adaptation with meta-learning

Reward: -228

*** unable to adapt enough during crippled in order to prevent rotation


MOLe (ours)

Reward: 1095

***Note: The live bar graph here illustrates latent task variables (x-axis) and the probability (y-axis) of each. We see new tasks being inferred/added online, as well as past tasks being recalled.

Two illustrative test tasks:

(1) The agent sees a set configuration of crippled joints for the entire duration of its test-time experience. This is similar to data seen during training, and thus, we see that even model-based RL and model-based RL with online gradient updates do not fail. The methods that include both meta-learning and adaptation, however, do have higher performance. Furthermore, we see again that continued gradient steps in this case of a single-task setting is not detrimental.

(2) The agent receives alternating periods of experience, between regions of normal operation and regions of having crippled legs. This non-stationary task distribution illustrates the need for online adaptation (model-based RL fails), the need for a good prior to adapt from (failures of model-based RL with online gradient updates), the harm of overfitting to recent experience and thus forgetting older skills (continued gradients steps has low performance), and the need for further adaptation away from the prior (k-shot adaptation shows limited performance).

MOLe is able to build its own representation of ``task" switches, and we see that this switch does indeed correspond to recognizing regions of leg crippling (500-1000, 1500-2000)

This figure shows the cumulative sum of rewards for trials where timesteps 500-1000 and 1500-2000 were periods of crippling two of the crawler's legs.

We see that k-shot adaptation does not improve when seeing a task again, continuous gradient steps gets worse from task switches, and MOLe is noticeably better as it sees the task more often.

Note that with MOLe, one skill does not explicitly hinder the other.

We ran another experiment by letting the crawler experience (during each trial) walking straight, making turns, and sometimes having a crippled leg. We compared the performance during the first 500 time steps of "walking forward in a normal configuration" to its last 500 time steps of "walking forward in a normal configuration."

While the beginning performance of continuous gradient steps was comparable to MOLe (average performance difference of +/-10%), its ending performance was 200% lower. Note the detrimental effect of updating knowledge without allowing for separate task specialization/adaptation.

Half-cheetah Motor Malfunctions

Half-cheetah agent (S ∈ R21, A ∈ R6)

Train: We train all models on data where an actuator is selected at random to experience a malfunction during the rollout. In this case, malfunction means that the polarity or magnitude of actions applied to that actuator are altered.

Test: The agent then experiences drastically out-of-distribution test tasks, such as altering all actuators at once or changing the malfunctions over time.


continued gradient steps

reward: -18


k-shot adaptation

reward: 100

signchanging3_ours_rew235 (1).mp4

MOLe (ours)

reward: 235

When the task distribution during the test trials contains only a single task, such as 'all negative' where all actuators are prescribed to be the opposite polarity, then continuous gradient steps performs well by continuously performing gradient updates on incoming data.

However, as shown in the other tasks, the performance of continuous gradient steps substantially deteriorates when the agent experiences a non-stationary task distribution. Due to overspecialization on recent incoming data, such methods that continuously adapt tend to forget and lose previously existing skills.

This overfitting and forgetting of past skills is also illustrated by the consistent deterioration in the performance of continuous gradient steps in this plot of the cumulative sum of rewards (for a task where the malfunction changed every 500 timesteps).

MOLe, on the other hand, dynamically builds a probabilistic task distribution and allows adaptation to these difficult tasks, without forgetting past skills.

This experiment involved the agent experiencing alternating periods of normal and crippled-leg operation.

This plot shows the successful recognition of new tasks as well as old tasks; note that both the recognition and adaptation are all done online, using neither a bank of past data to perform the adaptation, nor a human-specified set of task categories.

Terrain Slopes on Half-cheetah

Half-cheetah agent (S ∈ R21, A ∈ R6)

This is the task of a half-cheetah agent, traversing terrains of differing slopes. The prior model is meta-trained on data from terrains with random slopes of low magnitudes, and the test trials are executed on difficult out-of-distribution tasks such as basins, steep hills, etc.

Our findings for this set of experiments show that separate task parameters aren't always necessary for what might externally seem like separate tasks.



Latent task distribution over time for half-cheetah landscape traversal. Interestingly, we find that MOLe chooses to only use one latent task variable to describe the varying terrain. This suggests that the meta-learning process finds a task space where there is an easy skill transfer of slopes; thus, even when MOLe is faced with the option of switching tasks or adding new tasks to its dynamic latent task distribution, it chooses not to do so

Results on half-cheetah landscape traversal. The model-based RL and model-based RL with online gradient updates methods indicate that a single model is not sufficient for effectively handling this task and online learning is critical. Despite having similar training performance on the shallow training slopes, the two non-meta-learning baselines do indeed fail at these test tasks; this also shows that these tasks of changing slopes are not particularly similar to each other (and that the discovered task space is perhaps useful).

The meta-learning approaches perform similarly; however, since these trials involve terrain and other physics changes that is extrapolated from the data seen previously, it is important to take multiple gradient steps during online learning.

Test-time performance vs data points used for meta-training

Performance on test tasks (i.e., unseen during training) of models that are meta-trained with differing amounts of data. Performance numbers here are normalized per agent, between 0 and 1.