EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model
Task-agnostic Env. (Pre-training) Various Downstream Tasks (Fine-tuning)
At every timestep in the PT phase, the agents can only interact with the task-agnostic reward-free environment to obtain intrinsic rewards learned through a self-supervised manner. In contrast, in the FT phase, agents need to adapt quickly to downstream tasks with task-specific extrinsic rewards provided by the environment.
Abstract
Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics Model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG’s performance at 2M interactive steps with 20× more data.
mismatch issue: pre-training via diverse exploration is not enough for guaranteeing to facilitate downstream learning.
URL in model-free manner:
longer PT steps → oscillation in performance
EUCLID:
longer PT steps → more accurate dynamics model → monotonic improvement
Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics Model (EUCLID)
As a starter, EUCLID adopts the task-oriented latent dynamic model as the backbone for environment modeling, and contains two key parts:
a model-fused URL paradigm that innovatively integrates the world model into the pre-training and fine-tuning for facilitating downstream tasks learning
a multi-headed dynamics model that captures different environment dynamics separately for an accurate prediction of the entire environment.
In this way, EUCLID can achieve a fast downstream tasks adaptation by leveraging the accurate pre-trained environment model for an effective model planning in the downstream fine-tuning.
Model-fused URL paradigm
Multi-choice Learning Mechanism
Experiment
Combination
Performance
Walker-Run Quadruped-Run Jaco-Reach_bottom_left
Visualization videos: (Left) CIC vs. (Right) EUCLID
Unsupervised Reinforcement Learning Benchmark (URLB)
Humanoid-Stand
Humanoid-Walk
URLB-Extension
Humanoid-Run
Specialization
Specialized region of each predict head (Quadruped)
pre-trained ensemble policies for each head→→→
Wobbly Stand
Rotate forward
Turn over
Side jump