ETH Zurich, Switzerland
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
Robotic World Model (RWM) is a learned black-box neural network simulator that predicts future observations of a robot from a window of past observation-action history, letting policies train “in imagination” instead of requiring real interactions.
It is trained self-supervised and autoregressively over multiple stochastic forecast steps, so it learns to stay stable over long rollouts and mitigate compounding errors, even in partially observable and stochastic dynamics.
Architecturally, RWM uses a GRU with a dual-autoregressive mechanism (inner updates through history, outer feedback of predictions) plus heads for observations and privileged signals, giving a simple, domain-agnostic model that generalizes across robots and tasks.
RWM is a general neural network simulator that learns long-horizon dynamics without domain-specific inductive biases, and it stays accurate under autoregressive rollouts across very different robots and tasks.
In our evaluation suite, the same model and training pipeline work for manipulation and locomotion environments including Reach-UR10, Reach-Franka, Lift-Cube-Franka, Open-Drawer-Franka, Repose-Cube-Allegro, and velocity-tracking tasks for Unitree A1/Go1/Go2, ANYmal B/C/D, Spot, Cassie, H1, and G1.
Across all these settings, RWM trained autoregressively achieves the lowest prediction error relative to MLP, RSSM, and transformer baselines, demonstrating strong robustness and transferability.
Imagination
Ground Truth
RWM performs autoregressive imagination by feeding its own predicted observations back into the model, enabling long rollouts that stay closely aligned with ground-truth trajectories rather than drifting over time.
This stability comes from dual autoregression and multi-step self-supervised training, which explicitly reduce compounding errors even beyond the training forecast horizon.
As a result, imagined trajectories remain robust under noise and support reliable long-horizon policy learning and transfer.
MBPO-PPO trains policies inside RWM’s imagined rollouts, alternating between collecting real data, updating the world model autoregressively, and running long-horizon PPO updates on predicted trajectories.
Despite PPO’s tendency to exploit model errors, RWM’s rollout fidelity keeps model error decreasing and predicted rewards aligned with ground truth, enabling stable optimization over 100+ autoregressive steps.
The resulting policies transfer zero-shot to ANYmal D and Unitree G1 hardware, reliably tracking velocity commands and staying robust to impacts and disturbances.
Robustness is especially hard for learned dynamics, since even small perturbations can push autoregressive rollouts off-distribution and make the model hallucinate future trajectories.
RWM’s dual-autoregressive training and stable long-horizon prediction mitigate this compounding-error failure mode, enabling policies that stay reliable under real-world disturbances.
Because RWM predicts a stochastic distribution over next observations, each imagination rollout naturally introduces varied dynamics, effectively giving domain randomization for free and improving hardware generalization.
Robotics today still mostly follows a pretty rigid recipe: build a high-fidelity simulator, train a policy in it until it works, then deploy to hardware and call it done. That’s effective, but it also means learning stops the moment the robot hits the real world—right when conditions start drifting, contacts get weird, and the environment stops matching the sim. If we want robots that feel genuinely intelligent, they should be able to keep adapting after deployment, not just execute a frozen policy.
The catch is that online adaptation is hard. Real-world data is slow and expensive to collect, and naive exploration on hardware can be unsafe or just outright break the robot. So instead of relying on endless real trials, we’d like to learn a world model—a dynamics model trained from the data we do have—that can generate effectively unlimited imagined rollouts. In theory, that gives us the scale of simulation without the cost and risk of living on the robot.
But learning policies inside learned models is exactly where many approaches fall apart. PPO is still the empirically reliable workhorse for robot control, yet PPO needs long-horizon trajectories to estimate returns and fit its value function. If a learned model starts hallucinating during autoregressive rollouts, those errors compound over tens or hundreds of steps, and PPO will happily optimize against a fantasy world. So the real problem isn’t just “learn a model,” it’s “learn a model that stays stable when rolled out long-horizon, the way PPO requires.”
Most prior work tries to tame hallucination by injecting domain knowledge or specialized structure, which can help—but at the cost of narrowing where the method works. RWM takes the opposite bet: keep it simple, end-to-end, and black-box, while training autoregressively in a way that explicitly targets long-horizon stability. The payoff is that once imagination stays grounded, downstream PPO training stays grounded too—so the same pipeline can scale across real-world robotic tasks without needing hand-crafted tricks for each new domain.
Training RWM and MBPO-PPO on hardware would be the cleanest proof that the method really delivers online adaptation in the wild. While this is a key long-term objective, several challenges currently prevent real-world deployment.
During online learning, the policy often exploits minor world model errors, leading to overly optimistic behaviors that result in collisions. In simulation, these failures serve as necessary corrective signals, but in real hardware, they pose a risk to the robot. Our experiments show that such failures occur more than 20 times on average during online learning, which would be detrimental to real-world systems. Even if hardware collisions were acceptable, fully automating online learning would require a recovery policy capable of resetting the robot to an initial state—a particularly challenging requirement for large platforms like ANYmal D or Unitree G1. Additionally, privileged information used to finetune the dynamics model (e.g., contacts) must be either measured or estimated using onboard sensors, which may not always be available.
To mitigate error exploitation, uncertainty-aware world models could be explored, but integrating them into dynamics learning would require additional architectural modifications and further reduce data efficiency. Due to these challenges, we approximate real-world constraints by using only a single simulation environment with domain shifts from pretraining environments. This setup reduces engineering effort while proving the feasibility of our approach. Our ongoing work explicitly addresses these issues.