Future Seminars

Tuesday 30th APR 2024, 5PM UTC


Speaker: Sergey Samsonov (HSE University)

Title: Finite-Sample Analysis of the Temporal Difference Learning

Paper: https://arxiv.org/abs/2310.14286


More details:

Authors: Sergey Samsonov, Daniil Tiapkin, Alexey Naumov, Eric Moulines

Abstract: In this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear functional approximation for policy evaluation in discounted Markov Decision Processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence. 

Speaker Bio: Sergey Samsonov is a PhD candidate in mathematics and a Research Fellow at the International Laboratory of Stochastic Algorithms and High-Dimensional Inference of the National Research University Higher School of Economics. He obtained his bachelor's degree from Lomonosov Moscow State University (MSU) in 2017 and a Master's degree in applied mathematics from HSE and Skoltech in 2019. Sergey's research interests mainly revolve around sampling methods, reinforcement learning, and stochastic approximation. His primary research focus lies in the development and theoretical analysis of Markov Chain Monte Carlo (MCMC) algorithms and stochastic approximation problems with applications in RL.

Sergey Samsonov has authored 14 papers in leading journals and conferences in the field, including Statistics and Computing, Journal of Machine Learning Research, Mathematics of Operations Research, COLT, ICML, and NeurIPS.

Tuesday 7th MAY 2024, 5PM UTC


Speaker: Gene Li (Toyota Technological Institute at Chicago)

Title: When is Agnostic Reinforcement Learning Statistically Tractable?

Paper: https://openreview.net/forum?id=6gWpJ0IExE


More details:

Authors: Zeyu Jia, Gene Li, Alexander Rakhlin, Ayush Sekhari, Nathan Srebro

Abstract: We study the problem of agnostic PAC reinforcement learning RL: given a policy class , how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an -suboptimal policy with respect to \Pi? Towards that end, we introduce a new complexity measure, called the spanning capacity, that depends solely on the set (\Pi) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class . However, for online RL, the situation is more subtle. We show there exists a policy class with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional sunflower structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachable-state identification and policy evaluation in reward-free exploration.

Speaker Bio: Gene Li is a PhD student at Toyota Technological Institute at Chicago, advised by Nathan Srebro. His research focuses on machine learning theory and decision making. Prior to graduate school, he received his bachelor’s degree in Electrical Engineering from Princeton University.

Tuesday 14th MAY 2024, 5PM UTC


Speaker: Uri Sherman (Tel-Aviv University)

Title: Rate-Optimal Policy Optimization for Linear Markov Decision Processes

Paper: https://arxiv.org/abs/2308.14642

More details:

Authors: Uri Sherman, Alon Cohen, Tomer Koren, Yishay Mansour

Abstract: We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal O˜(K−−√) regret where K denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~K) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~K) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

Speaker Bio: Uri is a fourth-year PhD student at Tel-Aviv University, advised by Yishay Mansour and Tomer Koren. Prior to that, Uri spent a few years in various engineering and management positions in the private sector, and before that obtained his B.Sc. from Tel Aviv University and M.Sc. from the Weizmann Institute of Science, where he worked with Prof. Uriel Feige. Uri's research interests are in stochastic optimization and reinforcement learning theory.

Tuesday 21st MAY 2024, 5PM UTC


Speaker: Ruiqi Zhang (UC Berkeley)

Title: Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data

Paper: https://arxiv.org/abs/2307.04354

More details:

Authors: Ruiqi Zhang, Andrea Zanette

Abstract: In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies. In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected. 

Speaker Bio: Ruiqi prefers to maintain an air of mystery.