Rethinking Sequence Modeling: LLM Scaling Laws, Expressivity-Efficiency Tradeoffs, and the Role of Architecture (04/03/2026)
Presenter: Jiecheng Lu
Empirical scaling laws show that model performance improves predictably with more data and compute, but these laws are not architecture-agnostic. The architecture determines which scaling curve a model follows, and structural constraints in the attention mechanism can limit how much benefit scaling alone can deliver. In particular, standard attention relies on convex token mixing, channel-synchronized readout, and a fixed positional basis in score space: design choices that favor stability and efficiency, but restrict expressivity, especially in long-context or algorithmic settings. In this talk, we present a unified architectural perspective on how to lift these constraints and move sequence models onto more expressive scaling curves. We discuss three complementary directions: ZeroS, which enables stable signed token mixing and improves linear-time attention; the Free Energy Mixer, which replaces expectation-style reads with value-aware, channel-wise free-energy selection without changing asymptotic complexity; and HyperMLP, which reinterprets attention heads as dynamic MLPs with learnable sequence-space mixing aligned with autoregressive semantics. Together, these designs illustrate how modest architectural changes can yield substantial capability gains at fixed model size and compute. We conclude by discussing implications for language, vision, multivariate time series, and multimodal modeling, and highlight opportunities for downstream tasks in domains where expressive, efficient, and robust sequence modeling is critical.
Steering Discrete Diffusion across Training and Inference (02/20/2026)
Presenter: Kevin Rojas
Discrete diffusion models have risen as a powerful scheme for generative model on discrete data. They enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to auto-regressive large language models (LLMs). However, controlling their output to satisfy certain conditions remains challenging. In this talk we explore two different strategies for "steering" these models towards desired properties both through training and inference.
First, we address inference-time control by formalizing how to properly extend *Classifier-Free Guidance (CFG)* to discrete domains, enabling robust conditioning on class labels. Second, we transition to training-time alignment under verifiable reward functions. We introduce *Group Diffusion Policy Optimization (GDPO)*, a reinforcement learning algorithm specifically engineered for the unique transition dynamics of diffusion language models. Through a series of diverse benchmarks, we demonstrate how these methods bridge the gap between flexible diffusion generation and precise intent alignment.
This presentation is based on our recently accepted works at ICLR 2026.
Guidance: https://arxiv.org/pdf/2507.08965
GDPO: https://arxiv.org/pdf/2510.08554
Provable Long-Range Benefits of Next-Token Prediction (02/06/2026)
Presenter: Xinyuan Cao
Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next k tokens, for any k, can distinguish between k consecutive tokens of such documents and k tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in k, independent of the document length) on the model size needed to achieve such k-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
Links: