Hypernetworks for Zero-shot Transfer in Reinforcement Learning
Sahand Rezaei-Shoshtari1,2,3, Charlotte Morissette1,3, Francois R. Hogan3, Gregory Dudek1,2,3, David Meger1,2,3
1 McGill University, 2 Mila - Québec AI Institute, 3 Samsung AI Center Montreal
In Thirty-Seventh AAAI Conference on Artificial Intelligence. AAAI 2023.
TL;DR: We use hypernetworks to approximate RL solutions as a mapping from a family of MDPs to a family of near-optimal policies.
Abstract
In this paper, hypernetworks are trained to generate behaviors across a range of unseen task conditions, via a novel TD-based training objective and data from a set of near-optimal RL solutions for training tasks. This work relates to meta RL, contextual RL, and transfer learning, with a particular focus on zero-shot performance at test time, enabled by knowledge of the task parameters (also known as context). Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP. We show that, under certain conditions, this mapping can be considered as a supervised learning problem. We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from DeepMind Control Suite. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.
Demo
Introduction and Motivation
Transfer learning aims to generalize parameters from a trained policy to unseen tasks.
We study zero-shot transfer to new reward and dynamics settings for contextual MDPs with a known context parameter.
Assumptions and Problem Formulation
We approximate RL solutions as a mapping from a family of parameterized MDPs to a family of near-optimal solutions under certain assumptions:
HyperZero
Learning the mapping can be framed as a supervised learning problem under certain assumptions:
Inputs to the hypernetwork are the MDP context, including reward and dynamics parameters.
Outputs of the hypernetwork are the weights of the approximated near-optimal policy and value function.
The loss is defined as the error for predicting near-optimal actions and values using the main networks.
Temporal Difference Regularization
We use a TD loss to regularize the approximated critic by moving the predicted target value towards the current value estimate, obtained from the RL solution.
Experimental Results
Results are obtained on Contextual Control Suite developed based on DeepMind control Suite.
Reward parameters: desired speed (positive and negative values).
Dynamics parameters: body size, weight and inertia.
Experiments show strong zero-shot behaviors of Hyperzo, achieving nearly full performance of an RL learner training on the target task.
Zero-shot transfer to new rewards
Zero-shot transfer to new dynamics
Zero-shot transfer to new rewards and dynamics