Trajectory-wise Multiple Choice Learning
for Dynamics Generalization in Reinforcement Learning

Younggyo Seo*, Kimin Lee*, Ignasi Clavera, Thanard Kurutach, Jinwoo Shin, Pieter Abbeel

*Equal contribution

KAIST, UC Berkeley

[Github Code] [Paper]

Abstract

Model-based reinforcement learning (RL) has shown great potential in various control tasks in terms of both sample-efficiency and final performance. However,learning a generalizable dynamics model robust to changes in dynamics remains a challenge since the target transition dynamics follow a multi-modal distribution. In this paper, we present a new model-based RL algorithm, coined trajectory-wise multiple choice learning, that learns a multi-headed dynamics model for dynamics generalization. The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments. Moreover, we incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector, enabling the model to perform online adaptation to unseen environments. Finally, to utilize the specialized prediction heads more effectively, we propose an adaptive planning method, which selects the most accurate prediction head over a recent experience. Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods.

Multi-modal distribution of transition dynamics

As a motivating example for dynamics generalization problem in model-based RL, we visualize the next states obtained by crippling one of the legs of an ant robot [Figure (a)]. Figure (b) shows that the target transition dynamics follow a multi-modal distribution, where each mode corresponds to each leg of a robot, even though the original environment has deterministic transition dynamics. This implies that a model-based RL algorithm that can approximate the multi-modal distribution is required to develop a reliable and robust agent against changes in dynamics.

(a) Ant with crippled legs

(b) Multi-modality in transition dynamics

Method

In our paper, we propose a trajectory-wise multiple choice learning (T-MCL) that learns a multi-headed dynamics model for dynamics generalization. We present a trajectory-wise oracle loss for making each prediction head specialize in different environments, and then introduce a context-conditional prediction head to further improve generalization. Finally, we propose an adaptive planning method that generates actions by planning under the most accurate prediction head over a recent experience for planning.

(a) Multi-headed dynamics model

(b) Multiple choice learning

(c) Adaptive planning

Experimental results

T-MCL outperforms existing model-based RL baselines and sometimes outperforms PEARL, a model-free meta-RL method, in terms of sample-efficiency.

(a) Hopper

(b) HalfCheetah

(c) CrippledAnt

Analysis on specialization

We visualize the fraction of training trajectories assigned to each prediction head optimized by (a) MCL and (b) T-MCL on Hopper environments. Number in the (i, j)-th cell of each table denotes the fraction of trajectories with i-th environment parameter, i.e., mass, assigned to j-th head of a multi-headed dynamics model. (c) Generalization performance of dynamics models trained with MCL and T-MCL on unseen Hopper environments.

(a) Multiple-choice learning (MCL)

(b) Trajectory-MCL (T-MCL)

(c) Generalization performance

Qualitative analysis

We first train T-MCL on CartPole environments with different masses and visualize the agents' behavior by manually assigning specialization prediction heads. One can observe that agents act as if they are light-weight or heavy-weight when using prediction heads specialized for different environments.

(a-1) CartPole (mass=0.25) with prediction heads specialized for mass=0.25

(a-2) CartPole (mass=1.0) with prediction heads specialized for mass=0.25
Agent acts as if it is light-weight to move a pole to upright position

(b-1) CartPole (mass=2.5) with prediction heads specialized for mass=2.5

(b-2) CartPole (mass=1.0) with prediction heads specialized for mass=2.5
Agent acts as if it is heavy-weight to move a pole to upright position

Bibtex

@inproceedings{seo2020trajectory,

title={Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning},

author={Seo, Younggyo and Lee, Kimin and Clavera, Ignasi and Kurutach, Thanard and Shin, Jinwoo and Abbeel, Pieter},

booktitle={Advances in Neural Information Processing Systems},

year={2020}

}