Learning Rollout from Sampling:

An R1-Style Tokenized Traffic Simulation Model

Abstract

Learning diverse and high-fidelity traffic simulation from human-driver demonstration is crucial for the testing of autonomous driving. The recent next-token prediction (NTP) paradigm widely adopted in large language models (LLMs) has been used to learn the traffic simulations and {achieves iterative improvements via the supervised fine-tuning technique (SFT).} However, such methods constrain active exploration of potentially high-value motion tokens, particularly in sub-optimal areas. Entropy patterns offer a promising perspective for enabling exploration driven by motion token uncertainty. Inspired by these observations, we propose a novel tokenized traffic simulation policy, R1Sim, to undertake a pioneering exploration of reinforcement learning through motion token entropy patterns, comprehensively analyzing how different motion tokens affect simulation results. Specifically, we introduce an entropy-guided adaptive sampling mechanism that targets previously overlooked motion tokens with high uncertainty but potentially optimal. We further optimize motion behaviors using group relative policy optimization (GRPO), guided by a safety-aware reward design. In summary, these components together enable a well-balanced exploration–exploitation trade-off through diverse, high-uncertainty sampling and comparative group-wise estimation, resulting in realistic, safe, and diverse multi-agent motion behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance over state-of-the-art baselines.

Movitation

Despite promising motion generation stability, existing tokenized motion models still fail to handle these two key challenges: 1) how to enable adaptive exploration to sample multiple plausible motion scenarios. SOTA approaches implement the Top-K sampling strategy to select a fixed number of motion tokens for simulation rollout. The rigid strategy over-prioritizes high-probability motion tokens from the vocabulary while neglecting potentially valuable “hidden gem" behaviors in the token vocabulary, particularly detrimental in interactive scenarios, where diverse motion outcomes are essential. Once the exploration space is established, 2) how to enable effective exploitation to optimize realism and safety of multi-agent motion behaviors. Existing optimization methods such as supervised fine-tuning (SFT), often employ winner-takes-all approaches that force generated states to match expert demonstrations. However, over-reliance on potentially suboptimal ground truth may perpetuate unsafe behaviors.

Framework

R1Sim follows an NTP-based autoregressive framework for sequential motion token generation. Given the current scene context, the policy first estimates token-level uncertainty via entropy and performs entropy-guided adaptive sampling to generate diverse candidate rollouts. These rollouts are then evaluated using a token-level, safety-aware reward defined in the traffic simulation environment. Finally, the policy is optimized with GRPO, leveraging group-wise relative advantages and KL regularization to reinforce human-preferred motion behaviors.