Anonymous
Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between the training and testing phases. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this issue largely have aimed to broaden the training observation distribution, employing techniques like data augmentation and domain randomization. Nevertheless, given the sequential nature of the decision-making problem, and residual errors propagated by the policy model can accumulate throughout the trajectory, resulting in highly degraded performance.
In this paper, we leverage the observation that a learned reward prediction function is often able to still reliably predict rewards under domain shift. We exploit this property to fine-tune the policy from reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments.
Left: Example of source and target environment observations.
Right: Illustration of domain shift effect on reward prediction. Domain shift effect is tested by evaluating predicted rewards under both the source and the target environment with the same underlying states. Fitted linear regression for the predicted rewards against the groundtruth rewards for both source and target environment are plotted for visualization.
We propose to jointly learn a reward prediction model to predict rewards from experience. The policy is then fine-tuned using such predicted rewards in the target testing environment. Through extensive experiments in both simulations and real-world experiments, we show that the reward prediction model generalizes well across visual effect shifts and significantly enhances policy performance through fine-tuning.
Left: During training, we optimize the reward prediction module along with reinforcement learning using sampled transition tuples from replay buffer.
Right: During deployment fine-tuning, we use the transition tuples with predicted reward to finetune the reinforcement learning policy. The reward prediction module is frozen in this stage.
PRFT improves performes during fine-tuning phase. Below shows evaluation in environments under distracting control suite with varying degrees of distraction intensities. Error bar shows one standard deviation.
Zero-shot:
W/ PRFT: