Efficient Online Reinforcement Learning with Offline Data
Philip J. Ball*, Laura Smith*, Ilya Kostrikov*, Sergey Levine
Question
We have increasing access to prior offline data containing experiences from a behavior policy.
How can we incorporate this prior data into online RL simply + effectively? 🤔
Existing Approaches
Prior work focuses on offline pre-training, requiring additional complexity and design choices.
Why can't we just use Off-Policy RL? It should naturally handle off-policy data!
Problem #1: Value Divergence
Naïvely initializing the replay buffer of Off-Policy RL with offline data results in value divergence:
In offline RL, penalizing OOD data prevents extrapolation errors, fixing this divergence. BUT, this is a form of anti-exploration; harmful when transitioning online!
Enter LayerNorm! LayerNorm bounds extrapolation, but does not load the critic with negative values!
Problem #2: Data Ingestion
But without pre-training, how do we incorporate offline data?
Two key steps:
1: Symmetric sampling of offline and online data 50:50 per batch
2: Increase gradient steps per timestep to ‘backup’ offline data quickly
Final Method: Reinforcement Learning with Prior Data
*RLPD*
Putting it all together...
1. Mitigate Value Divergence
LayerNorm layers in the critic prevent exploding values without conservatism
2. Symmetric Sampling
Sample Offline and Online Data 50:50 per batch
3. Sample Efficient RL
Many gradient steps per timestep leverage offline data quickly
Results
Despite being significantly simpler, we outperform prior SotA by up to 150%! 🤯
RLPD also generalizes to pixel-based settings, which we evaluate on the offline V-D4RL dataset; we greatly improve on pure online methods: