Past Seminars - Spring 2024

Tuesday 7th MAY 2024


Speaker: Gene Li (Toyota Technological Institute at Chicago)

Title: When is Agnostic Reinforcement Learning Statistically Tractable?

Paper: https://arxiv.org/abs/2310.06113

Slides: here


More details:

Authors: Zeyu Jia, Gene Li, Alexander Rakhlin, Ayush Sekhari, Nathan Srebro

Abstract: We study the problem of agnostic PAC reinforcement learning RL: given a policy class , how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an -suboptimal policy with respect to \Pi? Towards that end, we introduce a new complexity measure, called the spanning capacity, that depends solely on the set (\Pi) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class . However, for online RL, the situation is more subtle. We show there exists a policy class with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional sunflower structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachable-state identification and policy evaluation in reward-free exploration.

Speaker Bio: Gene Li is a PhD student at Toyota Technological Institute at Chicago, advised by Nathan Srebro. His research focuses on machine learning theory and decision making. Prior to graduate school, he received his bachelor’s degree in Electrical Engineering from Princeton University.

Tuesday 30th APR 2024


Speaker: Sergey Samsonov (HSE University)

Title: Finite-Sample Analysis of the Temporal Difference Learning

Paper: https://arxiv.org/abs/2310.14286

Slides: here


More details:

Authors: Sergey Samsonov, Daniil Tiapkin, Alexey Naumov, Eric Moulines

Abstract: In this paper we consider the problem of obtaining sharp bounds for the performance of temporal difference (TD) methods with linear functional approximation for policy evaluation in discounted Markov Decision Processes. We show that a simple algorithm with a universal and instance-independent step size together with Polyak-Ruppert tail averaging is sufficient to obtain near-optimal variance and bias terms. We also provide the respective sample complexity bounds. Our proof technique is based on refined error bounds for linear stochastic approximation together with the novel stability result for the product of random matrices that arise from the TD-type recurrence. 

Speaker Bio: Sergey Samsonov is a PhD candidate in mathematics and a Research Fellow at the International Laboratory of Stochastic Algorithms and High-Dimensional Inference of the National Research University Higher School of Economics. He obtained his bachelor's degree from Lomonosov Moscow State University (MSU) in 2017 and a Master's degree in applied mathematics from HSE and Skoltech in 2019. Sergey's research interests mainly revolve around sampling methods, reinforcement learning, and stochastic approximation. His primary research focus lies in the development and theoretical analysis of Markov Chain Monte Carlo (MCMC) algorithms and stochastic approximation problems with applications in RL.

Sergey Samsonov has authored 14 papers in leading journals and conferences in the field, including Statistics and Computing, Journal of Machine Learning Research, Mathematics of Operations Research, COLT, ICML, and NeurIPS.

Tuesday 23RD APR


Speaker: Hamish Flynn (Universitat Pompeu Fabra)

Title: Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

Paper: https://arxiv.org/abs/2309.14298

Slides: here


More details:

Authors: Hamish Flynn, David Reeb, Melih Kandemir, Jan Peters

Abstract: We present improved algorithms with worst-case regret guarantees for the stochastic linear bandit problem. The widely used "optimism in the face of uncertainty" principle reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. The performance of the resulting bandit algorithm depends on the size of the confidence sequence, with smaller confidence sets yielding better empirical performance and stronger regret guarantees. In this work, we use a novel tail bound for adaptive martingale mixtures to construct confidence sequences which are suitable for stochastic bandits. These confidence sequences allow for efficient action selection via convex programming. We prove that a linear bandit algorithm based on our confidence sequences is guaranteed to achieve competitive worst-case regret. We show that our confidence sequences are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter confidence sequences give improved performance in several hyperparameter tuning tasks. 

Speaker Bio: Hamish Flynn is a postdoc at Universitat Pompeu Fabra. He is interested in various topics within machine learning theory, such as PAC-Bayes bounds, confidence sequences and bandits. Hamish obtained his PhD from TU Darmstadt, while working at the Bosch Center for Artificial Intelligence in Renningen.

Tuesday 16th APR


Speaker: Andrew Wagenmaker (University of Washington)

Title: Leveraging Offline Data in Online Reinforcement Learning

Paper: https://arxiv.org/abs/2211.04974

Slides: here


More details:

Authors: Yifei Zhou, Ayush Sekhari, Yuda Song, Wen Sun

Abstract: Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. 

Speaker Bio: Ayush is a postdoc researcher working with Prof. Sasha Rakhlin at MIT. He completed his Ph.D. from the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span interactive learning, stochastic optimization, and machine unlearning. Before his Ph.D., he spent a year at Google as a part of the inaugural Brain residency program. Before Google, he completed his undergraduate studies in computer science at IIT Kanpur in India where he was awarded the President's Gold medal. Ayush has also been a winner of a student best paper award at COLT 2019.

Tuesday 9th APR


Speaker: Ayush Sekhari (MIT)

Title: Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

Paper: https://arxiv.org/abs/2311.08384

Slides: here


More details:

Authors: Yifei Zhou, Ayush Sekhari, Yuda Song, Wen Sun

Abstract: Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. 

Speaker Bio: Ayush is a postdoc researcher working with Prof. Sasha Rakhlin at MIT. He completed his Ph.D. from the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span interactive learning, stochastic optimization, and machine unlearning. Before his Ph.D., he spent a year at Google as a part of the inaugural Brain residency program. Before Google, he completed his undergraduate studies in computer science at IIT Kanpur in India where he was awarded the President's Gold medal. Ayush has also been a winner of a student best paper award at COLT 2019.

Tuesday 2ND APR


Speaker: Matthew Zurek (University of Wisconsin-Madison)

Title: Span-Based Optimal Sample Complexity for Average Reward MDPs

Paper: https://arxiv.org/abs/2311.13469

Slides: here


More details:

Authors: Matthew Zurek, Yudong Chen

Abstract: We study the sample complexity of learning an ε-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound O˜(SAH/ε^2), where H is the span of the bias function of the optimal policy and SA is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters S,A,H and ε, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters.

Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for γ-discounted MDPs, showing that O˜(SAH/(1−γ)^2ε^2) samples suffice to learn a ε-optimal policy in weakly communicating MDPs under the regime that γ≥1−1/H, circumventing the well-known lower bound of Ω˜(SA/(1−γ)^3ε^2) for general γ-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use. 

Speaker Bio: Matthew is a PhD student in the Computer Sciences department at the University of Wisconsin-Madison, advised by Yudong Chen. His research interests span machine learning, optimization, and high-dimensional statistics, and particularly the theory of reinforcement learning. 

Tuesday 26th MAR


Speaker: Roberto Cipollone (La Sapienza University of Rome)

Title: Provably Efficient Offline Reinforcement Learning in Regular Decision Processes

Paper: https://openreview.net/forum?id=8bQc7oRnjm

Slides: here


More details:

Authors: Roberto Cipollone, Anders Jonsson, Alessandro Ronca, Mohammad Sadegh Talebi 

Abstract: This paper deals with offline (or batch) Reinforcement Learning (RL) in episodic Regular Decision Processes (RDPs). RDPs are the subclass of Non-Markov Decision Processes where the dependency on the history of past events can be captured by a finite-state automaton. We consider a setting where the automaton that underlies the RDP is unknown, and a learner strives to learn a near-optimal policy using pre-collected data, in the form of non-Markov sequences of observations, without further exploration. We present RegORL, an algorithm that suitably combines automata learning techniques and state-of-the-art algorithms for offline RL in MDPs. RegORL has a modular design allowing one to use any off-the-shelf offline RL algorithm in MDPs. We report a non-asymptotic high-probability sample complexity bound for RegORL to yield an -optimal policy, which makes appear a notion of concentrability relevant for RDPs. Furthermore, we present a sample complexity lower bound for offline RL in RDPs. To our best knowledge, this is the first work presenting a provably efficient algorithm for offline learning in RDPs.

Speaker Bio: Roberto is a last-year PhD candidate in Engineering and Computer Science at La Sapienza University of Rome, advised by Giuseppe De Giacomo. His research mainly focuses on Reinforcement Learning in environments with non-Markovian dynamics and on Hierarchical RL in the presence of abstractions of decision processes.

Tuesday 19th MAR


Speaker: Daniil Tiapkin (Ecole Polytechnique and Université Paris-Saclay)

Title: Model-free Posterior Sampling via Learning Rate Randomization

Paper: https://arxiv.org/abs/2310.18186

Slides: here 


More details:

Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

Abstract: In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order O˜(√(H5SAT)), where H is the planning horizon, S is the number of states, A is the number of actions, and T is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order O˜(H5/2T(dz+1)/(dz+2)), where dz denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

Speaker Bio: Daniil is a first-year PhD student at Ecole Polytechnique and Université Paris-Saclay, advised by Eric Moulines and Gilles Stoltz. Prior to that, Daniil got his BSc and MSc at HSE University, where started working in collaboration with Michal Valko and Pierre Menard. His research interests are focused on reinforcement learning theory, in particular, on randomized exploration, regularization, and intersection of RL with sampling.

Tuesday 12th MAR


Speaker: Daniil Tiapkin (Ecole Polytechnique and Université Paris-Saclay)

Title: From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

Papers: https://arxiv.org/abs/2205.07704  & https://arxiv.org/abs/2209.14414

Slides: here


More details:

Authors: Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard

Abstract: We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a \(Q\)-value function posterior as upper confidence bound on the optimal \(Q\)-value function. For Bayes-UCBVI, we prove a regret bound of order \(\tilde{O}\left(\sqrt{\H^3SAT}\right)\) where \(H\) is the length of one episode, \(S\) is the number of states, \(A\) the number of actions, \(T\) the number of episodes, that matches the lower-bound of \(\Omega\left(\sqrt{H^3SAT}\right)\) up to poly-log terms in \(H, S, A, T\) for a large enough \(T\). To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon \(H\) (and \(S\)) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest.


As an additional application of this anti-concentration bound, we also propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL) that only needs a number of posterior samples logarithmic in H, S, A, and T per state-action pair. The most importantly, this version of posterior sampling algorithm enjoys the same \tilde{O}\left(\sqrt{\H^3SAT}\right)\) regret guarantees, thereby answering the open problems raised by Agrawal and Jia (2017b) for the episodic setting.


We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

Speaker Bio: Daniil is a first-year PhD student at Ecole Polytechnique and Université Paris-Saclay, advised by Eric Moulines and Gilles Stoltz. Prior to that, Daniil got his BSc and MSc at HSE University, where started working in collaboration with Michal Valko and Pierre Menard. His research interests are focused on reinforcement learning theory, in particular, on randomized exploration, regularization, and intersection of RL with sampling.

Tuesday 5th MAR 2024


Speaker: Noah Golowich (MIT)

Title: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

Paper: https://arxiv.org/abs/2309.09457

Slides: here


More details:

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map ϕ(x,a) that maps state-action pairs to d-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a k-sparse linear MDP, there is an unknown subset S⊂[d] of size k containing all the relevant features, and the goal is to learn a near-optimal policy in only poly(k,logd) interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems.

Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming.

As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible. 

Speaker Bio: Noah Golowich is a PhD Student at Massachusetts Institute of Technology, advised by Constantinos Daskalakis and Ankur Moitra. His research interests are in theoretical machine learning, with particular focus on the connections between multi-agent learning, game theory, and online learning, and on the theory of reinforcement learning. He is supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship, and was supported by an MIT Akamai Fellowship.

Tuesday 27th Feb 2024


Speaker: Caio Kalil Lauand (University of Florida)

Title: The Curse of Memory in Stochastic Approximation

Paper: https://arxiv.org/abs/2309.02944

Slides: here 


More details:

Authors: Caio Kalil Lauand, Sean Meyn

Abstract: Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size α>0. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d. coefficients. This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The target bias is identified, even in the case of non-linear SA, and is in general non-zero. The remaining results are established for linear SA recursions: (ii) the bivariate parameter-disturbance process is geometrically ergodic in a topological sense; (iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise; (iv) the asymptotic covariance of the averaged parameters is within O(α) of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TD-learning. 

Speaker Bio: Caio Kalil Lauand received the B.S. degree in electrical engineering from the University of North Florida and the M.S. degree in electrical and computer engineering from the University of Florida. He is now a Ph.D. student at the University of Florida under the supervision of Prof. Sean Meyn. His focus is on stochastic approximation and applications such as optimization and reinforcement learning.

Tuesday 20th FEbruary 2024


Speaker: Tor Lattimore (DeepMind)

Title: A Lower Bound for Linear and Kernel Regression with Adaptive Covariates

Paper: https://proceedings.mlr.press/v195/lattimore23b.html

Slides: here


More details:

Authors: Tor Lattimore

Abstract: We prove that the continuous time version of the concentration bounds by Abbasi-Yadkori et al. (2011) for adaptive linear regression cannot be improved in general, showing that there can be a significant price for sequential design. This resolves the continuous time version of the COLT open problem by Vakili et al. (2021b) on confidence intervals for kernel regression with sequential designs. Experimental evidence suggests that improved confidence bounds are also not possible in discrete time.

Speaker Bio: Tor Lattimore is a research scientist at Google DeepMind in London working on machine learning theory. Together with Csaba Szepesvari he is the author of the book Bandit Algorithms, which is freely available at banditalgs.org.

Tuesday 13th February 2024


Speaker: Gellért Weisz (Deepmind)

Title: Online RL in Linearly qπ-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore

Paper: https://arxiv.org/abs/2310.07811

Slides: here 


More details:

Online RL in Linearly qπ-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore

Authors: Gellért Weisz, András György, Csaba Szepesvári

Abstract: We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear -realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly -realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly -realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an ϵ-optimal policy after polylog(H,d)/ϵ^2 interactions with the MDP, where H is the time horizon and d is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.

Speaker Bio: Gellert is a Research Scientist at Google DeepMind. His research focuses on Reinforcement Learning Theory. Gellert obtained his PhD at University College London, where he was advised by Csaba Szepesvári.