Offline reinforcement learning (RL) aims to learn a policy from a fixed dataset without additional environment interaction. However, effective offline policy learning often requires a large and diverse dataset to mitigate epistemic uncertainty. Collecting such data demands substantial online interactions, which are costly or infeasible in many real-world domains. Therefore, improving policy learning from limited offline data—achieving high data efficiency—is critical for practical offline RL. In this paper, we propose a simple yet effective plug-and-play pretraining framework that initializes the feature representation of a Q-network to enhance data efficiency in offline RL. Our approach employs a shared Q-network architecture trained in two stages: pretraining a backbone feature extractor with a transition prediction head; training a Q-network—combining the backbone feature extractor and a Q-value head—with any offline RL objective. Extensive experiments on the D4RL, Robomimic, V-D4RL, and ExoRL benchmarks show that our method substantially improves both performance and data efficiency across diverse datasets and domains. Remarkably, with only 10% of the dataset, our approach outperforms standard offline RL baselines trained on the full data.
We explicitly distinguish between sample efficiency and data efficiency (redefining the latter) by highlighting the major challenges and corresponding solutions below. While the online RL community actively explores the sample efficiency, a promising solution to the data efficiency problem remains to be discovered.
In this paper, we propose a simple yet effective pretraining framework that transfers learned transition features into the initialization of the Q-network to improve data efficiency in offline RL. To this end, we design a shared Q-network architecture by combining a shared backbone feature extractor with two shallow head networks: a transition head for next-state prediction and a Q-value head for estimating Q-value. We further introduce a two-stage learning strategy—a pretraining and an RL training—built upon the shared Q network for data-efficient offline RL.
Our method presents a two-phase training scheme during offline learning: pretraining and RL training. During the pretraining phase, the shared backbone attached with a shallow transition head is trained via the transition dynamics prediction task. Subsequently, the pretrained shared backbone is connected with a randomly initialized Q-layer and trained with the remaining offline RL value learning.
We demonstrate strong performance margins of our method by integrating with SOTA offline RL baselines (i.e., AWAC, CQL, IQL, and TD3+BC) on the D4RL benchmark. We consider three MuJoCo locomotion tasks with five datasets with varying sub-optimality. The blue scores highlight the gains over the corresponding baseline algorithms. Notably, AWAC combined with our method shows an average performance improvement of +140.37%.
HalfCheetah
Hopper
Walker2d
Our method improves popular offline RL baselines on the manipulation benchmark, Robomimic, validating superior adaptability to complex tasks. We remark that our method can be readily plugged into offline RL, where the data is generated by either the agent itself or a human expert with varying behavioral qualities.
Lift
Can
We further evaluate our method in a challenging domain with an extremely large action space, dexterous manipulation with 24 DoFs. Our method achieves notable improvements across offline RL algorithms and dataset optimality.
Pen
Hammer
Door
Relocate
Our method can be seamlessly integrated into a vision-based offline RL scheme by replacing the state input to the encoder with the latent representation from the shared backbone network, a common implementation in visual RL.
Walker Walk
Cheetah Run
Humanoid Walk
As we define the data efficiency in the context of offline RL, a truly data-efficient offline RL should improve its performance across varying suboptimality of the data. Hence, we validate our method with two orthogonal axes for data efficiency in D4RL: the quality and quantity of the data. We consider progressively reduced sizes of the data from 1% subset to 100% full data for the quantity, whereas the behavioral suboptimality represents the measurement of data quality. Overall, our method trained with only a 10% subset of the data outperforms the vanilla method with full data on average.
We assume that a smaller dataset would induce a more severe distribution shift compared to a larger dataset, as the smaller dataset often exhibits a narrower coverage of the empirical state space. To investigate an in-depth view of data distribution, we evaluate our method across datasets generated by different data collection (exploration) strategies, where each strategy produces a distinct data distribution. Specifically, we consider three dataset collection strategies (SMM, RND, and ICM) and three ratios (1%, 10%, and 100%) on the ExoRL benchmark. Notably, our method consistently improves the performance of the backbone offline RL across collection strategies and ratios.
To further support the claim we make, we consider a goal-reaching task in a maze environment. We train a CQL agent with two datasets gathered by DIAYN and Proto, with and without our method. The maze environment provides a geometric state space, where trajectories of the agent can be visualized directly. Within limited coverage in the state space, our method still enhances the baseline CQL agent with remarkable margins.
Visualized state visitation distribution across exploration strategies
Red marker stands for a starting state and yellow marker stands for a goal state in the maze.