Philip J. Ball*, Laura Smith*, Ilya Kostrikov*, Sergey Levine
We have increasing access to prior offline data containing experiences from a behavior policy.
How can we incorporate this prior data into online RL simply + effectively? 🤔
Prior work focuses on offline pre-training, requiring additional complexity and design choices.
Why can't we just use Off-Policy RL? It should naturally handle off-policy data!
Naïvely initializing the replay buffer of Off-Policy RL with offline data results in value divergence:
In offline RL, penalizing OOD data prevents extrapolation errors, fixing this divergence. BUT, this is a form of anti-exploration; harmful when transitioning online!
Enter LayerNorm! LayerNorm bounds extrapolation, but does not load the critic with negative values!
But without pre-training, how do we incorporate offline data?
Two key steps:
1: Symmetric sampling of offline and online data 50:50 per batch
2: Increase gradient steps per timestep to ‘backup’ offline data quickly
Putting it all together...
1. Mitigate Value Divergence
LayerNorm layers in the critic prevent exploding values without conservatism
2. Symmetric Sampling
Sample Offline and Online Data 50:50 per batch
3. Sample Efficient RL
Many gradient steps per timestep leverage offline data quickly
Despite being significantly simpler, we outperform prior SotA by up to 150%! 🤯
RLPD also generalizes to pixel-based settings, which we evaluate on the offline V-D4RL dataset; we greatly improve on pure online methods: