Winter 2026

February 24th 2026, 6 pm UTC

Title: Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

Paper: https://arxiv.org/abs/2503.07453

Slides: link

february 17th 2026, 6 pm UTC

Speaker: Aldo Pacchiano (Boston University)

Title: On the Hardness of Bandit Learning

Paper: https://arxiv.org/abs/2506.14746

Slides: link

More Details:

Authors: Nataly Brukhim, Aldo Pacchiano, Miroslav Dudik, Robert Schapire

Abstract: Bandit learning (best-arm identification) with structured reward classes is often viewed as an interactive analogue of classical learnability, inviting the hope for clean characterizations and generic algorithmic principles. In this talk, I’ll present evidence that this analogy breaks in fundamental ways: (i) no “VC-dimension–like” combinatorial quantity with finite character can characterize bandit learnability, even for finite classes, and (ii) computational hardness can be inherent to bandit learning, via a construction where the optimal action is identifiable with two queries yet no polynomial-time algorithm can achieve this unless RP = NP even though the same class admits efficient ERM-style consistency and other standard primitives. I’ll also discuss how observation noise reshapes query complexity, including sharp separations between noise-free and noisy regimes, and a final separation showing that query-optimal identification and regret-optimal learning can be incompatible objectives.

Speaker Bio: Aldo Pacchiano’s research revolves around the statistical and algorithmic foundations that let AI systems learn and adapt as the world around them changes. He is an Assistant Professor in Boston University’s Center for Computing & Data Sciences, and a visiting scientist at the Eric and Wendy Schmidt center at the Broad Institute where he develops methods for sequential decision making that allow computers to discover new knowledge with minimal human guidance. His research agenda is two-pronged: advancing the theory that explains how learning unfolds in dynamic environments and translating those insights into high-impact applications ranging from autonomous scientific discovery to the human-centric design of AI systems.

February 10th 2026, 6 pm UTC

Speaker: Alena Shilova (INRIA)

Title: StaQ it! Growing neural networks for Policy Mirror Descent

Paper: https://arxiv.org/abs/2506.13862

Slides: link

More Details:

Authors: Alena Shilova, Alex Davey, Brahim Driss, Riad Akrour

Abstract: In Reinforcement Learning (RL), regularization has emerged as a popular tool both in theory and practice, typically based either on an entropy bonus or a Kullback-Leibler divergence that constrains successive policies. In practice, these approaches have been shown to improve exploration, robustness and stability, giving rise to popular Deep RL algorithms such as SAC and TRPO. Policy Mirror Descent (PMD) is a theoretical framework that solves this general regularized policy optimization problem, however the closed-form solution involves the sum of all past Q-functions, which is intractable in practice. We propose and analyze PMD-like algorithms that only keep the last M Q-functions in memory, and show that for finite and large enough M, a convergent algorithm can be derived, introducing no error in the policy update, unlike prior deep RL PMD implementations. StaQ, the resulting algorithm, enjoys strong theoretical guarantees and is competitive with deep RL baselines, while exhibiting less performance oscillation, paving the way for fully stable deep RL algorithms and providing a testbed for experimentation with Policy Mirror Descent.

Speaker Bio: Alena Shilova is a researcher at Inria Saclay (Paris) (TAU team) working at the intersection of RL and scientific machine learning. More recently, Alena have started exploring growing neural network architectures with/for reinforcement learning, motivated by the challenge of building adaptive agents that can adjust their model capacity over time, particularly in non-stationary environments. Previously, Alena was a postdoctoral researcher at Inria Lille (Scool) and completed my PhD at Inria Bordeaux on efficient training of neural networks.

January 20th 2026, 6 pm UTC

Speaker: Uri Sherman (Tel Aviv University)

Title: Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning

Paper: https://arxiv.org/abs/2507.04406

Slides: link

More Details:

Authors: Uri Sherman, Tomer Koren, Yishay Mansour

Abstract: It is well known that efficient policy learning is possible in the function approximation setup subject to policy class completeness and suitable coverage conditions, in particular, without making any additional assumptions on the structure of the environment. In this talk, I will discuss our recent results that demonstrate efficient policy learning is possible under a weaker assumption originating in optimization literature, namely, variational gradient dominance (VGD) of the value function. I will present a general policy learning framework that reduces the reinforcement learning problem to first-order optimization in non-Euclidean space, and demonstrate how this leads to new algorithms as well as sheds new light on the convergence properties of existing ones. Specifically, we will discuss a novel Steepest Descent Policy Optimization method, the well-known Policy Mirror Descent algorithm, and the classical Conservative Policy Iteration (Kakade and Langford, 2002) algorithm, which we revisit through the lens of the Frank Wolfe method. Time permitting, we will conclude with a discussion of the practical relevance of the VGD condition and the algorithms considered.

Speaker Bio: Uri is a fifth-year PhD student at Tel-Aviv University, advised by Yishay Mansour and Tomer Koren. Prior to that, Uri spent a few years in various engineering and management positions in the private sector, and before that obtained his B.Sc. from Tel Aviv University and M.Sc. from the Weizmann Institute of Science, where he worked with Prof. Uriel Feige. Uri's research interests are in stochastic optimization and reinforcement learning theory.

december 16th 2025, 6 pm UTC

Speaker: Jongha Ryu (MIT)

Title: Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

Paper: https://arxiv.org/abs/2502.10826

Slides: link

More Details:

Authors: Jongha Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun

Abstract: I will present two improved algorithms for off-policy selection and learning in contextual bandits, where the goal is to identify a reward-maximizing policy using offline data collected by a fixed behavior policy.

For off-policy selection, I will introduce a pessimism-based algorithm that uses a new lower confidence bound derived from a betting framework with Cover’s universal portfolio. The resulting guarantee is variance-adaptive and strictly sharper than prior work.
For off-policy learning, I will describe a general condition on optimization objectives that reveals a different bias-variance tradeoff. A special instance, freezing, sharply reduces variance in small-data regimes while matching the best existing guarantees.

I will present experiments showing that the new selection method consistently outperforms previous approaches and that freezing yields substantial gains when data are scarce.

Speaker Bio: Jongha (Jon) Ryu is a postdoctoral associate at MIT EECS. He received his Ph.D. in Electrical and Computer Engineering from UC San Diego. His research focuses on statistical and mathematical foundations of scientific machine learning, with recent work on neural spectral methods, score-based generative modeling, and off-policy or sequential decision-making problems.

december 9th 2025, 6 pm UTC

Speaker: Nived Rajaraman (UC Berkeley)

Title: Beyond UCB: The curious case of non-linear ridge bandits

Paper: https://arxiv.org/abs/2302.06025

Slides: link

More Details:

Authors: Nived Rajaraman, Yanjun Han, Jiantao Jiao, and Kannan Ramchandran

Abstract: There is a large volume of work on bandits and reinforcement learning when the reward/value function satisfies some form of linearity. But what happens if the reward is non-linear? In this talk, I will present some of our work on the setting where the non-linear reward function takes the form of a ridge function / single index model. Two curious phenomena arise for ridge bandits: first, in addition to the "learning phase" with a standard regret, there is an "initialization phase" with a fixed sample cost determined by the nature of the reward function; second, achieving the smallest sample cost in the initialization phase requires new learning algorithms beyond traditional ones such as UCB. For a special family of non-linear bandits taking the form of a “ridge" function for a non-linear monotone function, we derive upper and lower bounds on the optimal fixed cost of learning, and in addition, on the entire “learning trajectory” via differential equations. Our work proposes a two-stage exploration algorithm which first finds a good initialization, and subsequently exploits local linearity in the learning phase. In several important special cases (convex $f$, and concave $f$), we prove that the optimal guarantees are also achieved by a very simple algorithm, an interactive variant of stochastic gradient descent. In contrast, several classical and celebrated algorithms, such as UCB and algorithms relying on online/offline regression oracles, are proven to be suboptimal.

Speaker Bio: Nived Rajaraman is currently a postdoctoral researcher at Microsoft Research NYC. He recently finished his PhD at Berkeley, advised by Jiantao Jiao and Kannan Ramchandran. His research interests lie in the intersection of sequential decision making and large language models. He was named a CPAL Rising Star for 2025.

NOvember 25th 2025, 6 PM UTC

Speaker: Dorian Baudry (INRIA)

Title: Does Stochastic Gradient really succeed for bandits?

Paper: https://openreview.net/pdf?id=gL4muAFwsh

Slides: link

More Details:

Authors: Dorian Baudry, Emmeran Johnson, Simon Vary, Ciara Pike-Burke, Patrick Rebeschini

Abstract: Recent works have deepened the theoretical understanding of the Stochastic Gradient Bandit (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently small learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret regimes of SGB as a function of its learning rate. For two–armed bandits, we identify a sharp threshold, below which SGB achieves logarithmic regret on all instances, and above which it can incur polynomial regret on some instances. This result highlights the necessity of knowing (or estimating) the minimum gap to ensure logarithmic regret with a constant learning rate. For general K-armed bandits, we further show the learning rate must additionally scale inversely with K to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.

Speaker Bio: I am a researcher at Inria Grenoble & Université Grenoble Alpes, working on the theory of sequential decision-making problems, in particular multi-armed bandits. This work was conducted while I was a postdoctoral researcher at the statistics department of Oxford University, where I was working with Patrick Rebeschini. Previously, I had a first experience as a postdoctoral researcher at ENSAE & IP Paris, where I was working with Vianney Perchet; and I received my PhD from Université de Lille in 2022, where I was supervised by Emilie Kaufmann and Odalric-Ambrym Maillard.

Page updated

Google Sites

Report abuse