SPRING 2025

June 10th 2025, 6 pm UTC

Title: Offline RL via Feature-Occupancy Gradient Ascent

Slides: here

More Details:

Offline RL via Feature-Occupancy Gradient Ascent

Authors: Gergely Neu, Nneka Okolo

Abstract: We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs) when the reward and transition models are linearly realizable under a known feature map. Starting from the classic linear-program formulation of the optimal control problem in MDPs, we develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies, defined as the expected feature vectors that can potentially be generated by executing policies in the environment. We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees, achieved under the least restrictive data coverage assumptions known in the literature. In particular, we show that the sample complexity of our method scales optimally with the desired accuracy level and depends on a weak notion of coverage that only requires the empirical feature covariance matrix to cover a single direction in the feature space (as opposed to covering a full subspace). Additionally, our method can be implemented efficiently without requiring any computational oracles, and requires no prior knowledge of the coverage ratio (or even an upper bound on it), which altogether make it the strongest known algorithm for this setting to date.

Speaker Bio: Nneka is a Norbert Wiener Postdoctoral Fellow at the Institute for Data, Systems, and Society (IDSS) in MIT. Prior to this, she completed her Ph.D. in computer science from Pompeu Fabra University in Barcelona, and holds a Bachelor’s degree in Applied Mathematics from University of Benin in Nigeria. Her current research lies at the intersection of reinforcement learning, stochastic optimization and online learning. Towards applying RL for real-world sequential decision-making, she is particularly interested in developing algorithms for RL in large MDPs (where the number of states is finite but intractably large) with provable time- and sample-complexity guarantees.

June 3rd 2025, 6 pm UTC

Speaker: David Janz (University of Oxford)

Title: When and why randomised exploration works (in linear bandits)

Paper: https://arxiv.org/abs/2502.08870

Slides: here

More Details:

When and why randomised exploration works (in linear bandits)

Authors: Marc Abeille, David Janz, Ciara Pike-Burke

Abstract: We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the d-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an n-step regret bound of the order $O(d\sqrt{n}\log{n})$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

Speaker Bio: I am a Research Fellow at the University of Oxford. Previously, I was fortunate to be a postdoctoral researcher with Csaba Szepesvári at the University of Alberta, working on sequential decision making. I completed my PhD in applied machine learning at the University of Cambridge, supervised by Zoubin Ghahramani and José Miguel Hernández-Lobato. My research focuses on designing and analysing simple, practical learning and decision-making algorithms with strong theoretical foundations, as well as on the theory underpinning machine learning and real-world experimental methodology.

If you are considering doing a PhD and are interested in the theory of machine learning, please do get in touch. Oxford has quite a few funded positions, but doesn't get too many strong applicants in this area.

May 27th 2025, 6 pm UTC

Speaker: Dhruv Rohatgi (MIT)

Title: Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

Paper: https://arxiv.org/abs/2502.12465

Slides: here

More Details:

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

Authors: Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster

Abstract: Error amplification is a widely-observed empirical phenomenon in methods for offline imitation learning (e.g., behavior cloning) and autoregressive language modeling (e.g., next-token prediction with log-loss) -- where errors in the learned policy/model compound as the horizon H increases. But what are the fundamental sources of error amplification? Is it the result of an inherent statistical or computational barrier, or can it be avoided with more clever algorithm design?

A growing body of empirical work hypothesizes that a fundamental source is misspecification, where the learner is not sufficiently expressive to represent the target distribution. In this talk we discuss a unified theoretical formalism for misspecified imitation learning and sequence modeling, where next-token prediction is a special case of behavior cloning. We verify that the standard methods incur error amplification due to misspecification, and ask to what extent this mode of error amplification can be avoided. We uncover techniques for (partially) mitigating error amplification, but also inherent algorithmic and computational barriers:

(1) Information-theoretically, one can avoid error amplification entirely, but the algorithm is impractical.

(2) Behavior cloning can be made more robust by practical modifications, but there is an inherent barrier to further improvements: any next-token prediction-style objective incurs "linear (in H)" error amplification.

(3) For the natural testbed of autoregressive (softmax) linear models, no computationally efficient algorithm can achieve "sub-polynomial" error amplification; however, at least for binary token spaces, one can smoothly trade compute for statistical power and surpass the "linear" barrier in sub-exponential time.

Speaker Bio: Dhruv Rohatgi is a PhD Student at the Massachusetts Institute of Technology. His research interests are in theoretical machine learning, with particular focus on understanding fundamental statistical problems from a computational perspective. He is fortunate to be advised by Ankur Moitra, and was supported by an NDSEG Graduate Fellowship and an MIT Akamai Fellowship.

May 20th 2025, 6 pm UTC

Speakers: Sikata Sengupta & Marcel Hussing (University of Pennsylvania)

Title: Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Paper: https://arxiv.org/abs/2405.16739

Slides: here

More Details:

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Authors: Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Abstract: Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of base or constituent policies (possibly heuristic) upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

Speaker Bio: Sikata Sengupta is a CS PhD student at the University of Pennsylvania. She is fortunate to be co-advised by Michael Kearns, Aaron Roth, and Duncan Watts. She is broadly interested in the intersection of Machine Learning Theory, Algorithmic Game Theory, and Reinforcement Learning/Control. Previously, she received her B.A.S. from Stanford University.

Marcel Hussing is a 5th year PhD student at the University of Pennsylvania, advised by Prof. Eric Eaton. The focus of his PhD research is the intersection of reinforcement learning and algorithmic stability, with a particular emphasis on stabilizing function approximation, dealing with stochastic initial conditions and multi-modal rewards, and achieving formal replicability to enable consistent and optimal reinforcement learning in high-dimensional spaces. While his primary focus is reinforcement learning, Marcel has also worked on a broad range of machine learning applications, including solutions for environmental observation, chemistry, and robotics. Most recently, he completed an internship at Microsoft Research, where he developed novel planning algorithms for transformers.

May 6th 2025, 6 pm UTC

Speaker: Jeongyeol Kwon (University of Wisconsin-Madison)

Title: RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Paper: https://arxiv.org/abs/2406.01389

Slides: here

More Details:

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Authors: Jeongyeol Kwon, Shie Mannor, Constantine Caramanis, Yonathan Efroni

Abstract: Real-world decision-making often requires adapting to partial and nuanced contexts — yet theoretical foundations for building such adaptive systems remain limited. In this talk, I will present recent work on reinforcement learning in partially observable environments, introducing the first sample-efficient algorithms for a broad class of Latent Markov Decision Processes. Our approach integrates core principles from statistical and offline learning — such as moment-matching, off-policy evaluation, and distribution learning — under structural assumptions that admit tractable exploration without full observability. Specifically, we propose new concepts including policy-segmentation, correlation-coverage, and the use of off-policy evaluation to derive online guarantees, offering new insight into how agents can adapt and generalize effectively without full information. I will conclude by exploring the near-term potential of these ideas towards large-scale systems, both in theory and practice.

Speaker Bio: Jeongyeol Kwon is a postdoctoral researcher at the Wisconsin Institute for Discovery, University of Wisconsin–Madison. His research interests lie in reinforcement learning, optimization, and their applications to adaptive decision-making systems. He received his Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, where he was supported by the Graduate Continuing Fellowship and the Kwanjeong Foundation Scholarship for Graduate Studies.

APRIL 29th 2025, 6 pm utc

Speaker: Max Simchowitz (Carnegie Mellon)

Title: The Pitfalls of Imitation Learning when Actions are Continuous

Paper: https://arxiv.org/abs/2503.09722

Slides: here

More Details:

The Pitfalls of Imitation Learning when Actions are Continuous

Authors: Max Simchowitz, Daniel Pfrommer, Ali Jadbabaie

Abstract: We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics satisfy a control-theoretic property called exponentially stability (i.e. the effects of perturbations decay exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to any algorithm which learns solely from expert data, including both behavior cloning and offline-RL algorithms, unless the algorithm produces highly "improper" imitator policies--those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity--or unless the expert trajectory distribution is sufficiently "spread." We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion Policies). We also establish a host of complementary negative and positive results for imitation in control systems.

Speaker Bio: His current interests span the gamut from mathematical to practical ranging broadly across topics in adaptive sampling, multi-arm bandits, complexity of convex and non-convex optimization, reinforcement learning, learning in linear and nonlinear dynamical systems, and fairness in machine learning.

Page updated

Google Sites

Report abuse