Speaker: Yuda Song (CMU)
Title: To Distill or Decide? Understanding the Algorithmic Trade-Off in Partially Observable Reinforcement Learning
Paper: https://arxiv.org/abs/2510.03207
Slides: link
More Details:
Authors: Yuda Song, Dhruv Rohatgi, Aarti Singh, J. Andrew Bagnell
Abstract: Partial observability remains one of the central challenges in reinforcement learning, requiring agents to learn complex, history-dependent policies. A popular practical strategy—privileged expert distillation—sidesteps much of this difficulty by exploiting latent state information available during training (e.g., from a simulator) to first learn an optimal Markovian policy and then imitate it, effectively disentangling "learning to see" from "learning to act." While computationally attractive, this approach has well-documented failure modes that are not yet fully understood.
In this talk, I will investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information. I will introduce the perturbed Block MDP, a simple but instructive theoretical model that makes precise predictions about when and why each approach succeeds or fails. I will then present controlled experiments on challenging simulated locomotion tasks that validate these predictions. Our main findings are twofold. First, I will show that the trade-off empirically hinges on the stochasticity of the latent dynamics, as predicted by contrasting approximate decodability with belief contraction in the perturbed Block MDP. Second, I will demonstrate a perhaps surprising result: the optimal latent policy is not always the best latent policy to distill. I will conclude by discussing new practical guidelines for effectively exploiting privileged information, with implications for policy learning across a range of partially observable domains.
Speaker Bio: Yuda Song is a fourth-year PhD candidate in the Machine Learning Department at Carnegie Mellon University, advised by Aarti Singh and Drew Bagnell. His research focuses on the theory of interactive decision-making, with emphasis on efficient and robust learning in reinforcement learning and applications on LLM post-training and robotics. His research is partially supported by the Two Sigma PhD Fellowship.
Speaker: Bastien Dubail (University of Bristol)
Title: Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning
Paper: https://arxiv.org/abs/2509.05193
Slides: link
More Details:
Authors: Bastien Dubail, Stefan Stojanovic, Alexandre Proutière
Abstract: Low-rank structure is a common implicit assumption in many reinforcement learning algorithms. In this talk, I will present novel insights into how such structures arise naturally from the dynamics of the environment and how they can be exploited. I will focus on the problem of estimating the successor measure, a matrix representation that encodes the future state occupancies. Successor representations have attracted a lot of attention among deep RL practitioners and have been claimed to exhibit better low-rank structures than, say, transition matrices. Our work attempts to understand this and eventually challenges this claim.
Our key idea is to connect approximate low-rankness with mixing phenomena in Markov chains: since successor measures capture long-range dynamics, mixing or convergence of long trajectories should induce approximate low-rank structure. We formalize this intuition and analyze its implications for the estimation of successor measures. Interestingly, we find that these operators still retain too much of the short-term dynamics to be effectively low-rank.
To address this issue, we introduce shifted successor measures, which skip a few initial transitions and thus focus more directly on the long-term dynamics. We show that these operators exhibit stronger low-rank properties and provide the first sample-complexity guarantees for their estimation. I will conclude with numerical results showing improved performance in goal-conditioned RL.
Speaker Bio: Bastien Dubail is a postdoctoral research associate at the University of Bristol, working with Anthony Lee and Christophe Andrieu. He previously held a postdoctoral position at KTH in the team of Alexandre Proutière and earned his PhD from INRIA Paris and Aix-Marseille Université under the supervision of Laurent Massoulié and Charles Bordenave. His research interests span the theory of stochastic processes, Markov chains, and reinforcement learning.
Speaker: Dhruv Rohatgi (MIT)
Title: Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration
Paper: https://arxiv.org/abs/2503.07453
Slides: link
More Details:
Authors: Dylan J. Foster, Zakaria Mhammedi, Dhruv Rohatgi
Abstract: Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration:
We introduce a new algorithm, SpannerSampling, which obtains optimal data-efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient *coverage*. Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. SpannerSampling improves upon the computational efficiency of existing active-exploration algorithms by leveraging *inference-time computation* with the pre-trained model to reduce the effective search space for exploration.
We contrast the result above by showing that *training-time interventions* (e.g., exploratory modifications to DPO) that produce proper policies cannot achieve similar guarantees in polynomial time: they must sacrifice either data-efficiency or computational efficiency.
We show that coverage is necessary for computational efficiency in our framework, matching our upper bound. However, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through *multi-turn exploration*. En route, we show that any MDP where the optimal KL-regularized value function is linear (linear-Q*_β) is learnable in the reset access model.
We view these results as a step toward a computational theory of decision making with generative models.
Speaker Bio: Dhruv Rohatgi is a final-year PhD student at the Massachusetts Institute of Technology, advised by Ankur Moitra. His research focuses on understanding how fundamental building blocks of machine learning can be fit together to solve more complex problems, particularly those that involve reliably making sequences of decisions.
Speaker: Aldo Pacchiano (Boston University)
Title: On the Hardness of Bandit Learning
Paper: https://arxiv.org/abs/2506.14746
Slides: link
More Details:
Authors: Nataly Brukhim, Aldo Pacchiano, Miroslav Dudik, Robert Schapire
Abstract: Bandit learning (best-arm identification) with structured reward classes is often viewed as an interactive analogue of classical learnability, inviting the hope for clean characterizations and generic algorithmic principles. In this talk, I’ll present evidence that this analogy breaks in fundamental ways: (i) no “VC-dimension–like” combinatorial quantity with finite character can characterize bandit learnability, even for finite classes, and (ii) computational hardness can be inherent to bandit learning, via a construction where the optimal action is identifiable with two queries yet no polynomial-time algorithm can achieve this unless RP = NP even though the same class admits efficient ERM-style consistency and other standard primitives. I’ll also discuss how observation noise reshapes query complexity, including sharp separations between noise-free and noisy regimes, and a final separation showing that query-optimal identification and regret-optimal learning can be incompatible objectives.
Speaker Bio: Aldo Pacchiano’s research revolves around the statistical and algorithmic foundations that let AI systems learn and adapt as the world around them changes. He is an Assistant Professor in Boston University’s Center for Computing & Data Sciences, and a visiting scientist at the Eric and Wendy Schmidt center at the Broad Institute where he develops methods for sequential decision making that allow computers to discover new knowledge with minimal human guidance. His research agenda is two-pronged: advancing the theory that explains how learning unfolds in dynamic environments and translating those insights into high-impact applications ranging from autonomous scientific discovery to the human-centric design of AI systems.
Speaker: Alena Shilova (INRIA)
Title: StaQ it! Growing neural networks for Policy Mirror Descent
Paper: https://arxiv.org/abs/2506.13862
Slides: link
More Details:
Authors: Alena Shilova, Alex Davey, Brahim Driss, Riad Akrour
Abstract: In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice, typically based either on an entropy bonus or a Kullback-Leibler divergence that constrains successive policies. In practice, these approaches have been shown to improve exploration, robustness and stability, giving rise to popular Deep RL algorithms such as SAC and TRPO. Policy Mirror Descent (PMD) is a theoretical framework that solves this general regularized policy optimization problem, however the closed-form solution involves the sum of all past Q-functions, which is intractable in practice. We propose and analyze PMD-like algorithms that only keep the last M Q-functions in memory, and show that for finite and large enough M, a convergent algorithm can be derived, introducing no error in the policy update, unlike prior deep RL PMD implementations. StaQ, the resulting algorithm, enjoys strong theoretical guarantees and is competitive with deep RL baselines, while exhibiting less performance oscillation, paving the way for fully stable deep RL algorithms and providing a testbed for experimentation with Policy Mirror Descent.
Speaker Bio: Alena Shilova is a researcher at Inria Saclay (Paris) (TAU team) working at the intersection of RL and scientific machine learning. More recently, Alena have started exploring growing neural network architectures with/for reinforcement learning, motivated by the challenge of building adaptive agents that can adjust their model capacity over time, particularly in non-stationary environments. Previously, Alena was a postdoctoral researcher at Inria Lille (Scool) and completed my PhD at Inria Bordeaux on efficient training of neural networks.
Speaker: Uri Sherman (Tel Aviv University)
Title: Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning
Paper: https://arxiv.org/abs/2507.04406
Slides: link
More Details:
Authors: Uri Sherman, Tomer Koren, Yishay Mansour
Abstract: It is well known that efficient policy learning is possible in the function approximation setup subject to policy class completeness and suitable coverage conditions, in particular, without making any additional assumptions on the structure of the environment. In this talk, I will discuss our recent results that demonstrate efficient policy learning is possible under a weaker assumption originating in optimization literature, namely, variational gradient dominance (VGD) of the value function. I will present a general policy learning framework that reduces the reinforcement learning problem to first-order optimization in non-Euclidean space, and demonstrate how this leads to new algorithms as well as sheds new light on the convergence properties of existing ones. Specifically, we will discuss a novel Steepest Descent Policy Optimization method, the well-known Policy Mirror Descent algorithm, and the classical Conservative Policy Iteration (Kakade and Langford, 2002) algorithm, which we revisit through the lens of the Frank Wolfe method. Time permitting, we will conclude with a discussion of the practical relevance of the VGD condition and the algorithms considered.
Speaker Bio: Uri is a fifth-year PhD student at Tel-Aviv University, advised by Yishay Mansour and Tomer Koren. Prior to that, Uri spent a few years in various engineering and management positions in the private sector, and before that obtained his B.Sc. from Tel Aviv University and M.Sc. from the Weizmann Institute of Science, where he worked with Prof. Uriel Feige. Uri's research interests are in stochastic optimization and reinforcement learning theory.
Speaker: Jongha Ryu (MIT)
Title: Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing
Paper: https://arxiv.org/abs/2502.10826
Slides: link
More Details:
Authors: Jongha Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun
Abstract: I will present two improved algorithms for off-policy selection and learning in contextual bandits, where the goal is to identify a reward-maximizing policy using offline data collected by a fixed behavior policy.
For off-policy selection, I will introduce a pessimism-based algorithm that uses a new lower confidence bound derived from a betting framework with Cover’s universal portfolio. The resulting guarantee is variance-adaptive and strictly sharper than prior work.
For off-policy learning, I will describe a general condition on optimization objectives that reveals a different bias-variance tradeoff. A special instance, freezing, sharply reduces variance in small-data regimes while matching the best existing guarantees.
I will present experiments showing that the new selection method consistently outperforms previous approaches and that freezing yields substantial gains when data are scarce.
Speaker Bio: Jongha (Jon) Ryu is a postdoctoral associate at MIT EECS. He received his Ph.D. in Electrical and Computer Engineering from UC San Diego. His research focuses on statistical and mathematical foundations of scientific machine learning, with recent work on neural spectral methods, score-based generative modeling, and off-policy or sequential decision-making problems.
Speaker: Nived Rajaraman (UC Berkeley)
Title: Beyond UCB: The curious case of non-linear ridge bandits
Paper: https://arxiv.org/abs/2302.06025
Slides: link
More Details:
Authors: Nived Rajaraman, Yanjun Han, Jiantao Jiao, and Kannan Ramchandran
Abstract: There is a large volume of work on bandits and reinforcement learning when the reward/value function satisfies some form of linearity. But what happens if the reward is non-linear? In this talk, I will present some of our work on the setting where the non-linear reward function takes the form of a ridge function / single index model. Two curious phenomena arise for ridge bandits: first, in addition to the "learning phase" with a standard regret, there is an "initialization phase" with a fixed sample cost determined by the nature of the reward function; second, achieving the smallest sample cost in the initialization phase requires new learning algorithms beyond traditional ones such as UCB. For a special family of non-linear bandits taking the form of a “ridge" function for a non-linear monotone function, we derive upper and lower bounds on the optimal fixed cost of learning, and in addition, on the entire “learning trajectory” via differential equations. Our work proposes a two-stage exploration algorithm which first finds a good initialization, and subsequently exploits local linearity in the learning phase. In several important special cases (convex $f$, and concave $f$), we prove that the optimal guarantees are also achieved by a very simple algorithm, an interactive variant of stochastic gradient descent. In contrast, several classical and celebrated algorithms, such as UCB and algorithms relying on online/offline regression oracles, are proven to be suboptimal.
Speaker Bio: Nived Rajaraman is currently a postdoctoral researcher at Microsoft Research NYC. He recently finished his PhD at Berkeley, advised by Jiantao Jiao and Kannan Ramchandran. His research interests lie in the intersection of sequential decision making and large language models. He was named a CPAL Rising Star for 2025.
Speaker: Dorian Baudry (INRIA)
Title: Does Stochastic Gradient really succeed for bandits?
Paper: https://openreview.net/pdf?id=gL4muAFwsh
Slides: link
More Details:
Authors: Dorian Baudry, Emmeran Johnson, Simon Vary, Ciara Pike-Burke, Patrick Rebeschini
Abstract: Recent works have deepened the theoretical understanding of the Stochastic Gradient Bandit (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently small learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret regimes of SGB as a function of its learning rate. For two–armed bandits, we identify a sharp threshold, below which SGB achieves logarithmic regret on all instances, and above which it can incur polynomial regret on some instances. This result highlights the necessity of knowing (or estimating) the minimum gap to ensure logarithmic regret with a constant learning rate. For general K-armed bandits, we further show the learning rate must additionally scale inversely with K to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.
Speaker Bio: I am a researcher at Inria Grenoble & Université Grenoble Alpes, working on the theory of sequential decision-making problems, in particular multi-armed bandits. This work was conducted while I was a postdoctoral researcher at the statistics department of Oxford University, where I was working with Patrick Rebeschini. Previously, I had a first experience as a postdoctoral researcher at ENSAE & IP Paris, where I was working with Vianney Perchet; and I received my PhD from Université de Lille in 2022, where I was supervised by Emilie Kaufmann and Odalric-Ambrym Maillard.