FORL:Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

Boğaziçi University, University of Tübingen, Özyeğin University, Osaka University

We introduce Forecasting in Non-stationary Offline RL (FORL), a novel framework designed to be robust to passive

non-stationarities, leveraging diffusion probabilistic models and time-series forecasting foundation models.

Setting

The agent does not know its location in the environment because its perception is offset every episode j by an unknown offset*. FORL leverages historical offset data and offline RL data (from a stationary phase) to forecast and correct for new offsets at test time. Ground-truth offsets are hidden throughout the evaluation episodes.

*Only vertical offsets are illustrated,offsets affect multiple dimensions in our experiments.

Existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time—assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific form of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent’s experience we aim to bridge the gap between offline RL and the complexity of real-world, non-stationary environments.

Illustrations of the real state, observations, candidate states generated by our FORL-DM, states predicted by

DQL-Lag-s[1,4] and FORL as the agent navigates in the maze2d-large environment [5].

How gracefully does performance degrade as the offset magnitude α is scaled from 0 (no offset) → 1?

Average normalized scores of FORL and baselines using Diffusion Q-Learning (DQL) [1] across offset scaling factors α ∈ {0, 0.25, 0.5, 0.75, 1} in the navigation environments [5].

Scaling factor α=0 is the standard offline RL environment, α=1.0 is our evaluation setup.

FORL is policy agnostic.

Average normalized scores of FORL and baselines using Flow Q-Learning (FQL) [6]

in OGBench [7] antmaze-large-navigate and cube-single-play.

What if we do not have access to past offsets?

Without access to historical offset information before evaluation, FORL-DM (FORL's diffusion model component) achieves a 151.4% improvement over DQL [1], demonstrating its efficacy as a standalone module trained solely on a standard, stationary offline RL dataset without offset labels.

FORL-DM: Directly uses the candidate states generated by the FORL's diffusion model component.

HLAG: We maintain a history of offsets generated by FORL-DM over the most recent episodes (excluding the evaluation interval, since offsets are not revealed after episode termination at test-time). We then feed this history into the zero-shot foundation model to generate offset samples for the next evaluation episodes. These samples are applied directly at test time.

H-LAG+DCM: We initially follow the same procedure in H-LAG to obtain predictions from the zero-shot foundation model. Then, we apply Dimension-wise Closest Match (DCM) to these predicted offsets and the candidate states generated by FORL-DM.

MED+DCM: Calculates the median of the offsets from the previous episode, starting with offsets predicted by the diffusion model. Then, fits a Gaussian distribution centered at this median, samples from it with the same count "l" as Zero-Shot FM, and applies DCM.

MED+NOISE: Computes the median offset from the diffusion model during the initial evaluation episode; in subsequent episodes, introduces random noise to the median of offsets from the previous episode.

Dimension-wise Closest Match (DCM) has the highest performance compared to standard approaches.

No hyperparameters & No fallback mechanism

KDE: For each dimension, we fit a kernel density estimator (KDE) on the states generated by FORL-DM and then we evaluate that probability density function for each point in the forecasted states. We obtain a single representative sample by taking the weighted average of samples in the forecasted states.

DM-FS-mean(s), DM-FS-med(s) select the closest prediction from DM to the mean and median of the forecaster's predictions, respectively.

MAX constructs a diagonal multivariate distribution from the dimension-wise mean and standard deviation of the forecasted states, then selects the sample predicted by our diffusion model with the highest likelihood under that distribution.

FORL (DCM) yields significantly stable prediction errors (Maximum Error: 2.40) for both maximum error and mean error compared to MAX (Maximum Error:9.33 ) demonstrating its robustness.

What if the offset changes are “in-episode” every f=50 timesteps?

Offsets are never revealed (DQL vs. FORL-DM) ➡️ FORL-DM

Offsets are revealed after each episode (DQL-Lag-s vs. FORL) ➡️ FORL

FORL Diffusion model's performance remains consistent across varying sample sizes, demonstrating robustness to the number of candidate states generated.

Citation

@inproceedings{

ada2025forecasting,

title={Forecasting in Offline Reinforcement Learning for Non-stationary Environments},

author={Suzan Ece Ada and Georg Martius and Emre Ugur and Erhan Oztop},

booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},

year={2025},

url={https://openreview.net/forum?id=24UJqxw1kv}

}

References

[1] Wang Z, Hunt JJ, Zhou M. Diffusion policies as an expressive policy class for offline reinforcement learning. ICLR2023,

[2] Fujimoto, Scott, and Shixiang Shane Gu. "A minimalist approach to offline reinforcement learning." Advances in neural information processing systems 34 (2021): 20132-20145.

[3] Yang, Rui, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, and Lei Han. "Rorl: Robust offline reinforcement learning via conservative smoothing." Advances in neural information processing systems 35 (2022): 23851-23866.

[4] Rasul, Kashif, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš et al. "Lag-llama: Towards foundation models for time series forecasting." In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models. 2023.

[5] Fu, Justin, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. "D4rl: Datasets for deep data-driven reinforcement learning." arXiv preprint arXiv:2004.07219 (2020).

[6] Park, Seohong and Frans, Kevin and Eysenbach, Benjamin and Levine, Sergey, OGBench: Benchmarking Offline Goal-Conditioned RL,ICLR2025

[7] Park, Seohong, Qiyang Li, and Sergey Levine. "Flow q-learning." ICML 2025

Google Sites

Report abuse