Hypernetworks for Zero-shot Transfer in Reinforcement Learning

Sahand Rezaei-Shoshtari1,2,3, Charlotte Morissette1,3, Francois R. Hogan3, Gregory Dudek1,2,3, David Meger1,2,3

1 McGill University, 2 Mila - Québec AI Institute, 3 Samsung AI Center Montreal


In Thirty-Seventh AAAI Conference on Artificial Intelligence. AAAI 2023.

TL;DR: We use hypernetworks to approximate RL solutions as a mapping from a family of MDPs to a family of near-optimal policies.

Abstract

In this paper, hypernetworks are trained to generate behaviors across a range of unseen task conditions, via a novel TD-based training objective and data from a set of near-optimal RL solutions for training tasks. This work relates to meta RL, contextual RL, and transfer learning, with a particular focus on zero-shot performance at test time, enabled by knowledge of the task parameters (also known as context). Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP. We show that, under certain conditions, this mapping can be considered as a supervised learning problem. We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from DeepMind Control Suite. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.

Demo


Introduction and Motivation

  • Transfer learning aims to generalize parameters from a trained policy to unseen tasks.

  • We study zero-shot transfer to new reward and dynamics settings for contextual MDPs with a known context parameter.

Assumptions and Problem Formulation

  • We approximate RL solutions as a mapping from a family of parameterized MDPs to a family of near-optimal solutions under certain assumptions:

HyperZero

  • Learning the mapping can be framed as a supervised learning problem under certain assumptions:

  • Inputs to the hypernetwork are the MDP context, including reward and dynamics parameters.

  • Outputs of the hypernetwork are the weights of the approximated near-optimal policy and value function.

  • The loss is defined as the error for predicting near-optimal actions and values using the main networks.

Temporal Difference Regularization

  • We use a TD loss to regularize the approximated critic by moving the predicted target value towards the current value estimate, obtained from the RL solution.

Experimental Results

  • Results are obtained on Contextual Control Suite developed based on DeepMind control Suite.

  • Reward parameters: desired speed (positive and negative values).

  • Dynamics parameters: body size, weight and inertia.

  • Experiments show strong zero-shot behaviors of Hyperzo, achieving nearly full performance of an RL learner training on the target task.

  • Zero-shot transfer to new rewards

  • Zero-shot transfer to new dynamics

  • Zero-shot transfer to new rewards and dynamics