Learning to Adapt in Dynamic, Real-Environments through Meta-RL

Abstract

Although reinforcement learning methods can achieve impressive results in simulation, the real world presents two major challenges: generating samples is exceedingly expensive, and unexpected perturbations or unseen situations cause proficient but specialized policies to fail at test time. Given that it is impractical to train separate policies to accommodate all situations the agent may see in the real world, this work proposes to learn how to quickly and effectively adapt online to new tasks. To enable sample-efficient learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context. Our experiments demonstrate online adaptation for continuous control tasks on both simulated and real-world agents. We first show simulated agents adapting their behavior online to novel terrains, crippled body parts, and highly-dynamic environments. We also illustrate the importance of incorporating online adaptation into autonomous agents that operate in the real world by applying our method to a real dynamic legged millirobot. We demonstrate the agent's learned ability to quickly adapt online to a missing leg, adjust to novel terrains and slopes, account for miscalibration or errors in pose estimation, and compensate for pulling payloads.

Real Robot Results

Performance (unseen tasks) visualized

The figure below illustrates that the robot most successfully follows a path (dotted black line) using our approach when facing unseen tasks. The four figures show the average and spread of paths followed over multiple trials for the tasks of (from left to right) 1) adapting to a missing front leg, 2) climbing a slick slope, 3) accounting for a miscalibrated pose estimation, and 4) compensating for a weight payload. We observe that with GrBAL, the robot follows the path accurately and consistently across trials.


Performance (unseen tasks) quantified

This figure plots single-step return values, averaged over multiple trials, for each of the four tasks mentioned above. The tasks are in a different order and payload values were calculated both on a straight path and a right-turn path.

We observe that the GrBAL method achieves maximum expected return for all tasks.

Also, an interesting thing to note is that the "loss of leg" task was particularly catastrophic for the general-purpose MB model, but the GrBAL learned a peculiar swiveling gait to surmount it.

Performance (training tasks) quantified

This table records single-step return values, averaged over multiple trials, for path-following on terrains which were in the training data set. For each of the three terrains (carpet, styrofoam, turf), we executed four paths: left turn, straight, zig-zag, and figure-eight.

We observe that GrBAL is comparable to MB across the board, which means that our approach is unnecessary for mastering training tasks, but that using it is also not detrimental.

Real Robot Video

Simulated Robot Results

Performance (fast adaptation and generalization):

In this series of experiments in simulation, we demonstrate that our approaches are highly responsive to novel situations and that they adapt quickly. In particular, fast adaptation (F.A.) is required to achieve high return on the half-cheetah disabled joint (HC Dis. F.A.), ant crippled leg (Ant Crip. F.A.), and half-cheetah pier (HC Pier) tasks. For the first two tasks, we disabled body parts in the middle of a rollout. In the third, the pier was composed of short segments of different buoyancy.

Further, we show the ability of our methods to generalize to unseen scenarios with the half-cheetah disabled joint (HC Dis. Gen.), ant crippled joint (Ant Crip. Gen.), and half-cheetah sloped terrain (HC Hill) tasks. This time, for the body failure tasks, we cripple a particular leg/joint which was not crippled in the training set. In the sloped terrain task, the half-cheetah traversed slopes much steeper than those encountered during training.

For all tasks, our two approaches either dominated or were comparable. In some scenarios, the normalized return for all approaches (except maybe TRPO) were similar. This may be because the task was not challenging enough to distinguish the performances. As a specific example, the crippled ant generalization task (Ant. Crip. Gen.) saw nearly equivalent returns, but the ant is nearly still a stable structure without a leg.

Note: the returns are normalized so that the MB oracle achieves a return of 1.

Sample Efficiency

We trained a state-of-the-art model-free meta-RL method (MAML-RL) and a model-free method (TRPO) until convergence, using the equivalent of about two days of real-world experience. (Note: the dotted line indicates average return at convergence.) We then trained GrBAL and ReBAL on 1/1000 of this quantity of data and plotted the average returns.

We observe that our approaches are either superior or equivalent to the model-free agent trained with 1000x more data. Our approach falls behind the asymptotic performance of the model-free meta-RL method. However, in some domains, performance can be willingly compromised for sample-efficiency.

Simulated Robot Videos

Half-cheetah Pier

Each block moves up and down when stepped on, and the changes in the dynamics are rapidly changing, due to each block having different damping and friction properties. The half-cheetah is meta-trained by varying these block properties, and tested on a specific (randomly-selected) configuration of block properties.

Ant Cripple Leg

For each meta-training rollout, we randomly sample a leg on a quadrupedal robot and disable it. Disabling a leg unexpected drastically changes the dynamics. We evaluate the performance crippling a leg from outside the training distribution.

Half-cheetah Sloped Terrain

During meta-training, we choose terrain of varying gentle upward and downward slopes.

In this task, it is especially important to incorporate past experience into the model, since the cheetah has no means of directly observing the incline. We evaluate performance on a hill that goes up and down.