Steering Discrete Diffusion across Training and Inference (02/20/2026)
Presenter: Kevin Rojas
Discrete diffusion models have risen as a powerful scheme for generative model on discrete data. They enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to auto-regressive large language models (LLMs). However, controlling their output to satisfy certain conditions remains challenging. In this talk we explore two different strategies for "steering" these models towards desired properties both through training and inference.
First, we address inference-time control by formalizing how to properly extend *Classifier-Free Guidance (CFG)* to discrete domains, enabling robust conditioning on class labels. Second, we transition to training-time alignment under verifiable reward functions. We introduce *Group Diffusion Policy Optimization (GDPO)*, a reinforcement learning algorithm specifically engineered for the unique transition dynamics of diffusion language models. Through a series of diverse benchmarks, we demonstrate how these methods bridge the gap between flexible diffusion generation and precise intent alignment.
This presentation is based on our recently accepted works at ICLR 2026.
Guidance: https://arxiv.org/pdf/2507.08965
GDPO: https://arxiv.org/pdf/2510.08554
Provable Long-Range Benefits of Next-Token Prediction (02/06/2026)
Presenter: Xinyuan Cao
Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next k tokens, for any k, can distinguish between k consecutive tokens of such documents and k tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in k, independent of the document length) on the model size needed to achieve such k-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
Links: