ETH Zurich, Switzerland
Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.
Uncertainty-Aware Robotic World Model (RWM-U) extends Robotic World Model (RWM) by explicitly modeling how reliable its predictions are, in addition to what will happen. It augments the autoregressive world model with ensemble-based uncertainty estimation, allowing the model to quantify epistemic uncertainty that arises from limited or biased offline data.
By propagating this uncertainty consistently over long imagined rollouts, RWM-U can identify regions where predictions become unreliable under distribution shift. This uncertainty signal is then used during policy learning to penalize high-risk imagined transitions, enabling stable long-horizon rollouts and making fully offline model-based reinforcement learning practical on real robots.
Each ensemble member of RWM-U independently predicts a Gaussian distribution over the next observation. The variance within each prediction captures aleatoric uncertainty, while the variance across ensemble means estimates epistemic uncertainty arising from limited or biased offline data.
During policy training in imagination, MBPO-PPO penalizes high-risk imagined transitions using this uncertainty signal, balancing task performance against model confidence.
RWM-U builds on the same general, domain-agnostic neural simulator as RWM, but augments it with ensemble-based epistemic uncertainty estimation that remains consistent over long autoregressive rollouts.
In our evaluation suite, RWM-U is tested on both manipulation and locomotion tasks in fully offline settings, including Reach-Franka manipulation and velocity-tracking locomotion for ANYmal D and Unitree G1, with policies trained entirely from fixed datasets and no online interaction.
Across these environments, RWM-U produces uncertainty estimates that closely track long-horizon prediction error and enables uncertainty-penalized policy optimization to outperform uncertainty-unaware model-based and model-free baselines, demonstrating reliable long-horizon modeling and safe generalization under severe distribution shift.
Imagination
Ground Truth
RWM-U explicitly tracks epistemic uncertainty during imagination and keeps rollouts within regions where the learned dynamics remain trustworthy.
MOPO-PPO leverages this calibrated uncertainty to steadily increase imagination reward while avoiding exploitation of model errors, enabling meaningful exploration without drifting off-distribution.
As a result, the learned policy transfers cleanly to hardware and produces stable locomotion while still discovering behaviors beyond the offline dataset.
When uncertainty is not properly constrained, MOPO-PPO can exploit imperfections in RWM and inflate rewards purely in imagination.
The policy becomes overconfident in hallucinated rollouts, optimizing toward behaviors that look optimal in the model but violate real robot dynamics.
This failure mode appears on hardware as unstable walking, unintended collisions, or complete breakdown of locomotion despite high imagined returns.
If uncertainty penalties are too strong, MOPO-PPO is forced to stay in extremely low-uncertainty regions of RWM-U learned dynamics.
This prevents the policy from exploring new locomotion strategies in imagination and traps learning near the dataset’s safest behaviors.
Consequently, deployment becomes overly cautious, often standing still or producing only small, ineffective motions that fail to make task progress.
RWM-U enabled MOPO-PPO achieves hardware performance comparable to strong online model-free RL baselines, despite relying primarily on offline data.
By learning dynamics that support long-horizon rollouts and using uncertainty-aware imagination to train robust policies, RWM-U reduces the need for extensive real-world interaction.
When augmented with a modest amount of real robot data, RWM-U further bridges the sim-to-real gap and surpasses simulator-based online model-free RL on real deployments.
Much of modern offline reinforcement learning is built around off-policy, value-function methods like CQL, IQL, and COMBO. On standard benchmarks these approaches work remarkably well, but robotics, especially legged locomotion, plays by different rules. In practice, the most reliable controllers for real robots today are trained with on-policy methods, and PPO in particular has emerged as the empirical workhorse for high-dimensional control. It scales cleanly with massively parallel simulation, optimizes stably over long horizons, and produces policies that reliably transfer to hardware. The underlying reasons for this gap are still not fully understood, and existing studies consistently show how difficult it is to tune SAC-based methods to achieve PPO-level stability in large-scale locomotion training. Despite continued research, there is still no reliable recipe for deploying SAC-style approaches in this domain.
If offline RL is to work on real robots, a different alignment of tools is required. Policies must be trained on-policy using long-horizon rollouts, and they must be able to leverage synthetic experience because no new real-world data is available. This makes model-based reinforcement learning not just attractive, but essential. A learned world model can expand a limited dataset into millions of imagined trajectories, enabling long-horizon credit assignment that would otherwise be impossible.
However, learned models are imperfect, and in fully offline settings there is no corrective feedback. Errors compound during long autoregressive rollouts, and PPO will optimize against hallucinated dynamics unless it is explicitly informed about what is unreliable. Prior offline model-based approaches typically avoid this issue by restricting rollouts to short horizons, relying on single-step predictions, or operating in simplified simulated domains. These design choices do not scale to real locomotion.
RWM-U is designed around this exact failure mode. By building on an autoregressive world model and explicitly estimating epistemic uncertainty, it tracks where predictions degrade as rollouts extend over tens or hundreds of steps. That uncertainty is propagated into policy optimization, allowing PPO to improve policies while remaining grounded in regions supported by data. The result is a fully offline, long-horizon, uncertainty-aware model-based RL pipeline that remains stable and deployable on real quadruped and humanoid robots.