Efficient Online Reinforcement Learning with Offline Data

arXiv Link GitHub Code

Philip J. Ball*, Laura Smith*, Ilya Kostrikov*, Sergey Levine

ICML '23, Honolulu, HI, USA

Question

We have increasing access to prior offline data containing experiences from a behavior policy.

How can we incorporate this prior data into online RL simply + effectively? 🤔

Existing Approaches

Prior work focuses on offline pre-training, requiring additional complexity and design choices.

Why can't we just use Off-Policy RL? It should naturally handle off-policy data!

Problem #1: Value Divergence

Naïvely initializing the replay buffer of Off-Policy RL with offline data results in value divergence:

In offline RL, penalizing OOD data prevents extrapolation errors, fixing this divergence. BUT, this is a form of anti-exploration; harmful when transitioning online!

Enter LayerNorm! LayerNorm bounds extrapolation, but does not load the critic with negative values!

Problem #2: Data Ingestion

But without pre-training, how do we incorporate offline data?

Two key steps:

1: Symmetric sampling of offline and online data 50:50 per batch

2: Increase gradient steps per timestep to ‘backup’ offline data quickly

Final Method: Reinforcement Learning with Prior Data

RLPD

Putting it all together...

1. Mitigate Value Divergence

LayerNorm layers in the critic prevent exploding values without conservatism

2. Symmetric Sampling

Sample Offline and Online Data 50:50 per batch

3. Sample Efficient RL

Many gradient steps per timestep leverage offline data quickly

Results

Despite being significantly simpler, we outperform prior SotA by up to 150%! 🤯

RLPD_Tweet.mp4

RLPD also generalizes to pixel-based settings, which we evaluate on the offline V-D4RL dataset; we greatly improve on pure online methods:

Links and More information

Code: GitHub

Paper: Link

Twitter: Thread

Email: ball@robots.ox.ac.uk

Page updated

Google Sites

Report abuse