Winter 2026

June 23rd 2026, 4 pm UTC

Speaker: Andrew Wagenmaker (UC Berkeley)

Title: Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Slides: link

More Details:

Authors: Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine

Abstract: Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.

Speaker Bio: Andrew Wagenmaker is a postdoctoral scholar in Electrical Engineering and Computer Sciences at UC Berkeley working with Sergey Levine. Previously, he completed a PhD in Computer Science at the University of Washington, where he was advised by Kevin Jamieson. Andrew’s research focuses on learning in dynamic, interactive settings, spanning fundamental algorithm development to practical approaches for real-world learning and decision-making, particularly toward enabling efficient learning in robotic systems. His work has been recognized by a Best Paper nomination at the Conference on Robot Learning, and he is a recipient of the NSF Graduate Research Fellowship.

June 16th 2026, 4 pm UTC

Speaker: Noah Golowich (Microsoft Research)

Title: Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference

Paper: https://arxiv.org/abs/2603.07887

Slides: link

More Details:

Authors: Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, Akshay Krishnamurthy

Abstract: Efficiently sampling from a complex probability distribution is a fundamental problem across machine learning and theoretical computer science. It has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from large language models (LLMs) have been proposed to solve challenging reasoning problems spanning domains such as mathematics and coding. For the most part, however, we lack a principled understanding of the accuracy--cost tradeoffs for such procedures. In this talk, we propose a formalization for such tasks as the problem of producing a sample from a target probability measure, given an oracle which yields approximate density estimates for the target measure. Depending on the context, this oracle may be interpreted as an approximate verifier or a *process reward model* for a particular language modeling task. This setup is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant & Vazirani (1986) and Jerrum & Sinclair (1989).

Generalizing results from existing literature, we establish provable guarantees for the Sequential Monte Carlo algorithm and related particle filtering approaches, which have recently found success empirically in the context of both language modeling and diffusion. In particular, our theory identifies a few properties of the oracle which suffice for efficient sampling. We conduct experiments to show that these properties indeed correlate with sampling performance for certain language modeling tasks.

The efficacy of such sampling algorithms, however, is limited by the relationship between the underlying LLM and the particular sampling task at hand, which has motivated the framework of Test-Time Training (TTT). In particular, TTT updates a model's weights in response to partial generations and reward feedback received at inference time. In the latter half of the talk, we will discuss some provable benefits of TTT in the context of our sampling framework.

Based on https://arxiv.org/pdf/2603.07887 (joint work with Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, and Akshay Krishnamurthy); and https://arxiv.org/pdf/2606.11437 (joint work with Ankur Moitra and Dhruv Rohatgi).

Speaker Bio: Noah Golowich is a postdoctoral researcher at Microsoft Research, NYC. In 2026, he will join the computer science department at UT Austin as an Assistant Professor. He completed he PhD at MIT, where he was advised by Constantinos Daskalakis and Ankur Moitra. He was a recipient of the 2025 AAAI/ACM SIGAI Doctoral Dissertation Award, the 2025 SIGecom Doctoral Dissertation Award, and the 2026 EATCS Doctoral Dissertation Award. His research focuses broadly on the theoretical foundations of modern AI. He is particularly interested in the role that computational constraints play in shaping our current and future toolkit of algorithms for machine learning and AI.

June 9th 2026, 4 pm UTC

Speaker: Zakaria Mhammedi (Google Research)

Title: Decoupling Exploration and Policy Optimization: Uncertainty-Guided Tree Search for Hard Exploration

Paper: https://arxiv.org/abs/2603.22273

Slides:

More Details:

Authors: Zakaria Mhammedi, James Cohan

Abstract: The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new approach that explicitly decouples exploration from policy optimization and bypasses RL entirely during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard exploration benchmarks. Further, we demonstrate that the trajectories discovered during exploration can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art performance by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

Speaker Bio: Zak Mhammedi is a Research Scientist at Google Research, focusing on reinforcement learning and optimization. He completed his PhD in Computer Science at the Australian National University and previously held a postdoctoral position at MIT. Zak’s work bridges the gap between theoretical and practical AI, particularly in developing efficient reinforcement learning algorithms. He has presented at top conferences such as COLT, NeurIPS, and ICML, with several papers receiving oral and spotlight recognition.

June 2nd 2026, 4 pm UTC

Speaker: Daniel Russo (Columbia University)

Title: Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

Paper: https://arxiv.org/abs/2601.18175

Slides: link

More Details:

Authors: Daniel Russo

Abstract: A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names -- rejection sampling with SFT, goal-conditioned RL, Decision Transformers -- yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a χ2 divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence -- measuring how random variation in action choices affects success rates -- are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce a dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing that this can amplify improvement, but at the cost of potential misalignment with the true objective.

Speaker Bio: Daniel Russo is an associate professor in the Decision, Risk, and Operations division of Columbia Business School. He completed his undergraduate studies in math and economics at the University of Michigan, doctoral studies at Stanford University under the supervision of Benjamin Van Roy, and worked as a postdoctoral researcher at Microsoft Research New England. His research has been recognized by several awards in the operations research community: the George Nicholson Prize (best paper by a PhD student), the JFIG Paper Award (best paper by a junior faculty member), the Frederick W. Lanchester Prize (best contribution to operations research in the past five years), and the Erlang Prize (early career award for contributions to applied probability). He currently serves as an associate editor at Management Science, Operations Research, and Stochastic Systems.

Page updated

Google Sites

Report abuse