MBRL-Game

University of Washington and Google Research, Brain Team

PAPER, CODE

International Conference on Machine Learning (ICML) 2020

Motivation

Model-Based RL (MBRL) has received considerable interest due to its potential for sample efficient learning. While recent works have proposed new algorithms and heuristics, an algorithmic framework that can capture the practical challenges and unify the core insights from prior work has been lacking. As a result, designing stable and efficient MBRL algorithms with rich function approximators has remained challenging.

Contributions

We present a framework that casts MBRL as a two player game.
We develop two families of algorithms to solve this game: PAL and MAL, which have complementary strengths. Together, they encapsulate, unify, and generalize a large collection of existing MBRL algorithms.
Practical implementations of PAL and MAL lead to SOTA results on tasks from OpenAI gym, ROBEL, and hand manipulation suite.

MBRL as a two player game

Policy player maximizes rewards in the model (M)
Model player minimizes prediction error under the induced distribution of policy in the world (W)
At Nash equilibrium, (1) the model predicts the policy's true performance accurately; (2) the policy is near-optimal.

A game formulation separates MBRL into policy optimization and model learning. It makes clear that the two components should function together to find equilibrium and naive independent learning is unlikely to succeed.

Algorithms

Developing algorithms for general continuous games is known to be challenging. Direct extension of works-horses in learning (e.g. SGD) to game settings are known to suffer due to non-stationarities introduced by the game. To design stable gradient-based algorithms, we instead consider the Stackelberg formulation of the MBRL game. Stackelberg games are asymmetric games where we pre-specify the order in which players update their parameters. They permit stable gradient-based algorithms through approximate bi-level optimization. Similar ideas have recently been used to study meta-learning, GANs, human-robot interaction, and primal-dual RL. The key to solving Stackelberg games is to make one player learn very quickly, while the other player learns slowly. Two algorithm families emerge from this viewpoint.

Policy as Leader (PAL)

PAL learns policy in the outer level (slow learning) and model in the inner level (fast learning). Model is thus an implicit function of the policy, reducing the game to a bi-level optimization problem.

In practice, we use the first-order gradient approximation:

Model as Leader (MAL)

MAL learns model in the outer level (slow learning) and policy in the inner level (fast learning). Policy is thus an implicit function of the model, reducing the game to a bi-level optimization problem.

In practice, we use the first-order gradient approximation:

Sample Efficiency

Practical versions of the above templates (with model-based natural policy gradient) lead to sample efficient learning. Both algorithms (PAL and MAL) can solve tasks from the ROBEL suite in under 1 hour and a dexterous in-hand manipulation task in about 3-4 hours. These results are 6X more efficient compared to SOTA model-free algorithms like soft actor critic (SAC).

Turn D'Claw to random target angles

MBRL-GAME-dclawTurnRandom.mp4

Orient D'Kitty to random targets

MBRL-GAME-dkittyOrientRandom.mp4

Orient pen to random targets

MBRL-GAME-penOrientRandom.mp4

In gym benchmarks, our algorithms outperform all prior works. Our results suggest that PAL-NPG and MAL-NPG: (a) are substantially more sample efficient than prior model-based and model-free methods; (b) achieve the asymptotic performance of their model-free counterparts; (c) scale gracefully to high dimensional tasks with complex dynamics like dexterous manipulation; (d) can scale to tasks requiring extended rollouts like OpenAI gym.

Choosing between PAL and MAL

To illustrate the relative strengths of PAL and MAL, we study their learning performance in two non-stationary environments.

Non-stationary dynamics

We consider a task of reaching spatial goals with a 7DOF robot arm. Midway through the learning, we introduce a dynamics perturbation by changing the length of the elbow from scenario 1 to scenario 2 illustrated below. At the point of the perturbation, all algorithms suffer a performance degradation. Since PAL utilizes only recent data, it quickly adapts to the dynamics change and enables the policy to recover. In contrast, MAL adapts the model conservatively and does not forget old inconsistent data, thereby biasing and slowing down policy learning.

Non-stationary goals

We now consider a perturbation to the goal distribution (without any dynamics change). Midway through training the goal distribution is changed as shown below. Note that the policy does not generalize zero-shot to the new goal distribution, and requires additional learning or fine-tuning. Since MAL learns a more broadly accurate model, it is able to solve the new goal distribution very quickly. In contrast, PAL consumes more data since it needs to build local models around all the intermediate policies encountered during the course of learning.

Thus, in summary, we find that PAL is better suited for situations where the dynamics of the world can drift over time. In contrast, MAL is better suited for situations where the task or goal distribution can change over time, and related settings like multi-task learning.

Summary

We formulate MBRL as a two-player game between: (1) a policy player which aims to maximize rewards under the learned model; (2) a model player which aims to fit the data collected by the policy player.
We cast the MBRL game in its Stackelberg form, and solve it with approximate bi-level optimization. The Stackelberg game, being asymmetric, can take two forms based on which player is chosen as the leader. This gives rise to two natural algorithm families: PAL and MAL.
Together, PAL and MAL encapsulate, unify, and generalize a large collection of existing MBRL algorithms.
Through experiments on a suite of continuous control tasks, we verify that a practical version of PAL and MAL leads to sample efficient learning.

Detailed overview
(with a sample MAL and PAL algorithm)

Page updated

Google Sites

Report abuse