Efficient Online Reinforcement Learning with Offline Data

Philip J. Ball*, Laura Smith*, Ilya Kostrikov*, Sergey Levine

Question

We have increasing access to prior offline data containing experiences from a behavior policy.

How can we incorporate this prior data into online RL simply + effectively? 🤔

Existing Approaches

Prior work focuses on offline pre-training, requiring additional complexity and design choices.

Why can't we just use Off-Policy RL? It should naturally handle off-policy data!

Problem #1: Value Divergence

Naïvely initializing the replay buffer of Off-Policy RL with offline data results in value divergence:

In offline RL, penalizing OOD data prevents extrapolation errors, fixing this divergence. BUT, this is a form of anti-exploration; harmful when transitioning online!

Enter LayerNorm! LayerNorm bounds extrapolation, but does not load the critic with negative values!

Problem #2: Data Ingestion

But without pre-training, how do we incorporate offline data?


Two key steps:

1: Symmetric sampling of offline and online data 50:50 per batch

2: Increase gradient steps per timestep to ‘backup’ offline data quickly


Final Method: Reinforcement Learning with Prior Data

*RLPD*

Putting it all together...

1. Mitigate Value Divergence

LayerNorm layers in the critic prevent exploding values without conservatism


2. Symmetric Sampling

Sample Offline and Online Data 50:50 per batch


3. Sample Efficient RL

Many gradient steps per timestep leverage offline data quickly

Results

Despite being significantly simpler, we outperform prior SotA by up to 150%! 🤯

RLPD_Tweet.mp4

RLPD also generalizes to pixel-based settings, which we evaluate on the offline V-D4RL dataset; we greatly improve on pure online methods:

Links and More information

Code: GitHub

Paper: Link

Twitter: Thread

Email: ball@robots.ox.ac.uk