Past Seminars - Autumn 2021

Tuesday 23rd November 2021, 6pm UTC

Speaker: Yichun Hu (Cornell)

Title: Fast Rates for the Regret of Offline Reinforcement Learning

Paper: https://arxiv.org/abs/2102.00479

Slides: here

More details:

Fast Rates for the Regret of Offline Reinforcement Learning

Authors: Yichun Hu, Nathan Kallus, Masatoshi Uehara

Abstract: We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted Q-iteration (FQI), suggest a O(1/sqrt{n}) convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function Q^*, the regret of the policy it defines converges at a rate given by the exponentiation of the Q^*-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply O(1/n) regret rates in linear cases and exp(-Omega(n)) regret rates in tabular cases.

Speaker Bio: Yichun is a PhD candidate in Operations Research at Cornell University, advised by Professor Nathan Kallus. She is broadly interested in data-driven decision making problems, including but not limited to contextual bandits, reinforcement learning and stochastic optimization.

Tuesday 9th November 2021, 6pm UTC

Speaker: Tengyang Xie (UIUC)

Title: Bellman-consistent Pessimism for Offline Reinforcement Learning

Paper: https://arxiv.org/abs/2106.06926

Slides: here

More details:

Bellman-consistent Pessimism for Offline Reinforcement Learning

Authors: Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal

Abstract: The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear MDPs where stronger function-approximation assumptions hold, our result improves upon a recent bonus-based approach by O(d) in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Speaker Bio: Tengyang is a PhD candidate in computer science at the University of Illinois at Urbana-Champaign, advised by Professor Nan Jiang. His research focuses on reinforcement learning (RL)—especially offline RL, function approximation, and exploration.

Tuesday 2nd November 2021, 5pm UTC

Speaker: Martin J. Wainwright (UC Berkeley)

Title: Optimal policy evaluation using kernel-based temporal difference methods

Paper: https://arxiv.org/abs/2109.12002

Slides: here

More details:

Optimal policy evaluation using kernel-based temporal difference methods

Authors: Yaqi Duan, Mengdi Wang, Martin J. Wainwright

Abstract: We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate; in the population limit of infinite data, it corresponds to the fixed point of a projected Bellman operator defined by the associated reproducing kernel Hilbert space. The estimator itself is obtained by computing the projected fixed point induced by a regularized version of the empirical operator; due to the underlying kernel structure, this reduces to solving a linear system involving kernel matrices. We analyze the error of this estimate in the L^2(mu)-norm, where denotes the stationary distribution of the underlying Markov chain. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size and the effective horizon H=(1-gamma)^{-1}. Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

Speaker Bio: Martin Wainwright joined the faculty at University of California at Berkeley in Fall 2004, and is currently a Chancellor's Professor with a joint appointment between the Department of Statistics and the Department of Electrical Engineering and Computer Sciences. He received his Bachelor's degree in Mathematics from University of Waterloo, Canada, and his Ph.D. degree in Electrical Engineering and Computer Science (EECS) from Massachusetts Institute of Technology (MIT), for which he was awarded the George M. Sprowls Prize from the MIT EECS department in 2002. He is interested in high-dimensional statistics, information theory and statistics, and statistical machine learning. He has received an Alfred P. Sloan Foundation Research Fellowship (2005), IEEE Best Paper Awards from the Signal Processing Society (2008) and Communications Society (2010); the Joint Paper Award from IEEE Information Theory and Communication Societies (2012); a Medallion Lecturer (2013) of the Institute for Mathematical Statistics; a Section Lecturer at the International Congress of Mathematicians (2014); and the COPSS Presidents' Award in Statistics (2014). He is currently serving as an Associate Editor for the Annals of Statistics, Journal of Machine Learning Research, Journal of the American Statistical Association, and Journal of Information and Inference.

Tuesday 26th OCtober 2021, 5pm UTC

Speaker: Masatoshi Uehara (Cornell)

Title: Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Paper: https://arxiv.org/abs/2107.06226

Slides: here

More details:

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Authors: Masatoshi Uehara, Wen Sun

Abstract: We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.

Speaker Bio: Masatoshi is a Ph.D. student at Cornell University advised by Nathan Kallus. He is also closely working with Wen Sun/Nan Jiang. His research interests are reinforcement learning, causal inference, and their application to digital marketing. Previously, he completed a master’s degree at Harvard in statistics and a bachelor's degree at the University of Tokyo in applied mathematics/computer science.

Tuesday 19th OCtober 2021, 5pm UTC

Speaker: Angeliki Kamoutsi (ETH Zurich)

Title: Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations

Paper: http://proceedings.mlr.press/v139/kamoutsi21a.html

Slides: here

More details:

Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations

Authors: Angeliki Kamoutsi, Goran Banjac, John Lygeros

Abstract: We consider large-scale Markov decision processes with an unknown cost function and address the problem of learning a policy from a finite set of expert demonstrations. We assume that the learner is not allowed to interact with the expert and has no access to reinforcement signal of any kind. Existing inverse reinforcement learning methods come with strong theoretical guarantees, but are computationally expensive, while state-of-the-art policy optimization algorithms achieve significant empirical success, but are hampered by limited theoretical understanding. To bridge the gap between theory and practice, we introduce a novel bilinear saddle-point framework using Lagrangian duality. The proposed primal-dual viewpoint allows us to develop a model-free provably efficient algorithm through the lens of stochastic convex optimization. The method enjoys the advantages of simplicity of implementation, low memory requirements, and computational and sample complexities independent of the number of states. We further present an equivalent no-regret online-learning interpretation.

Speaker Bio: Angeliki Kamoutsi is a PhD student in the Automatic Control Laboratory at ETH Zürich, under the supervision of Prof. John Lygeros. Her research interests lie in reinforcement learning theory. In particular, the overall goal of her PhD work is to combine modern stochastic optimization and statistical learning theory tools for the design of provably efficient algorithms for inverse reinforcement learning. Previously, Angeliki completed a Master's degree in Mathematics at ETH Zurich and a Diploma degree in Applied Mathematics and Physics at the National Technical University of Athens.

Tuesday 5th OCtober 2021, 5pm UTC

Speaker: Andrew Wagenmaker (U. Washington)

Title: Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Paper: https://arxiv.org/abs/2108.02717

Slides: here

More details:

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Authors: Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson

Abstract: The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying ϵ-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an ϵ-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show that this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an ϵ-optimal policy at the instance-optimal rate.

Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the ``reachability'' of a state. We show that our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Speaker Bio: Andrew is a fourth-year Ph.D. student in Computer Science at the University of Washington working with Kevin Jamieson. His research interests are in active learning, reinforcement learning, and control theory. Andrew is supported by an NSF Graduate Research Fellowship. Previously, Andrew completed a master's and bachelor's degree at the University of Michigan, both in Electrical Engineering. While at the University of Michigan, he worked with Raj Rao Nadakuditi and Necmiye Ozay.

Tuesday 28th September 2021, 5pm UTC

Speaker: Dongruo Zhou (UCLA)

Title: Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Paper: https://arxiv.org/abs/2012.08507

Slides: here

More details:

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Authors: Dongruo Zhou, Quanquan Gu, Csaba Szepesvari

Abstract: We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR+ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR+ attains an Õ (dH\sqrt{T}) regret where d is the dimension of feature mapping, H is the length of the episode and T is the number of interactions with the MDP. We also prove a matching lower bound Ω(dH\sqrt{T}) for this setting, which shows that UCRL-VTR+ is minimax optimal up to logarithmic factors. In addition, we propose the UCLK+ algorithm for the same family of MDPs under discounting and show that it attains an Õ (d\sqrt{T}/(1−γ)^1.5) regret, where γ∈[0,1) is the discount factor. Our upper bound matches the lower bound Ω(d\sqrt{T}/(1−γ)^1.5) proved by Zhou et al. (2020) up to logarithmic factors, suggesting that UCLK+ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

Speaker Bio: Dongruo is a third-year PhD student at Department of Computer Science, University of California, Los Angeles, advised by Prof. Quanquan Gu. Previously he obtained my B.Sc from Department of Mathematics Science, Tsinghua University. After that, Dongruo spent one year at Department of System and Information Engineering (Engineering Systems and Environment), University of Virginia. Dongruos research interests are about optimization, reinforcement learning and machine learning. He is interested in designing provable new stochastic algorithms based on classical first-order or high-order deterministic algorithms. He is also interested in understanding basic limitations for different classes of algorithms.

Tuesday 21st September 2021, 5pm UTC

Speaker: Dipendra Misra (Microsoft Research) and Qinghua Liu (Princeton)

Title: Provable Rich Observation Reinforcement Learning with Combinatorial Latent States

Paper: https://openreview.net/forum?id=hx1IXFHAw7R

Slides: here

More details:

Provable Rich Observation Reinforcement Learning with Combinatorial Latent States

Authors: Dipendra Misra, Qinghua Liu, Chi Jin, John Langford

Abstract: We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a "latent" factored state. These observations are "rich enough" to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting, which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.

Speaker Bio: Dipendra Misra is a Senior Researcher at Microsoft Research, New York. He received a PhD in computer science from Cornell University (2019) and a bachelors in computer science from Indian Institute of Technology Kanpur (2013). Dipendras main research interest is in developing efficient machine learning algorithms with applications to real-world problems. The word efficient here includes provable, sample and computationally efficient, interpretable, scalable, and ethical. His empirical focus is on problems in natural language understanding and allied fields. He is currently active in reinforcement learning theory, interactive learning, representation learning, and language and vision problems. Dipendra also have interest in computational social science, and data and society. Qinghua Liu is a PhD student at Princeton University advised by Chi Jin. He is interested in reinforcement learning with function approximation, multi-agents or partial information.

Tuesday 14th September 2021, 5pm UTC

Speaker: Gaurav Mahajan (UCSD)

Title: Bilinear Classes: A Structural Framework for Provable Generalization in RL

Paper: https://arxiv.org/abs/2103.10897

Slides: here

More details:

Bilinear Classes: A Structural Framework for Provable Generalization in RL

Authors: Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

Abstract: This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear Q∗/V∗ model in which both the optimal Q-function and the optimal V-function are linear in some known feature space. Our main result provides an RL algorithm which has polynomial sample complexity for Bilinear Classes; notably, this sample complexity is stated in terms of a reduction to the generalization error of an underlying supervised learning sub-problem. These bounds nearly match the best known sample complexity bounds for existing models. Furthermore, this framework also extends to the infinite dimensional (RKHS) setting: for the the Linear Q∗/V∗ model, linear MDPs, and linear mixture MDPs, we provide sample complexities that have no explicit dependence on the explicit feature dimension (which could be infinite), but instead depends only on information theoretic quantities.

Speaker Bio: Gaurav is a 4th year PhD student in the theory group at UCSD, where he is advised by Sanjoy Dasgupta and Shachar Lovett. Before that, he worked at Microsoft for three years. Gaurav received a BS and MS in Mathematics and Computing from Indian Institute of Technology, Delhi (IITD) advised by Subiman Kundu. In Summer 2021, he was a research intern working with Sham Kakade and Akshay Krishnamurthy at Microsoft Research, New York. Prior to that he has also worked with Jason Lee at Institute for Advanced Study and the Simons Institute. Gaurav is broadly interested in questions related to learning and has recently worked on problems related to reinforcement learning, active learning and unsupervised learning.

Tuesday 7th September 2021, 4pm UTC

Speaker: Rui Lu (Tsinghua University)

Title: On the Power of Multitask Representation Learning in Linear MDP

Paper: https://arxiv.org/abs/2106.08053

Slides: here

More details:

On the Power of Multitask Representation Learning in Linear MDP

Authors: Rui Lu, Gao Huang, Simon S. Du

Abstract: While multitask representation learning has become a popular approach in reinforcement learning (RL), theoretical understanding of why and when it works remains limited. This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP) under a generative model. In this paper, we consider an agent to learn a representation function ϕ out of a function class Φ from T source tasks with N data per task, and then use the learned ϕ̂ to reduce the required number of sample for a new task. We first discover a \emph{Least-Activated-Feature-Abundance} (LAFA) criterion, denoted as κ, with which we prove that a straightforward least-square algorithm learns a policy which is Õ (H^2\sqrt{C(Φ)2κdNT+κdn}) sub-optimal. Here H is the planning horizon, (Φ) is Φ's complexity measure, d is the dimension of the representation (usually d≪C(Φ)) and n is the number of samples for the new task. Thus the required n is O(κdH^4) for the sub-optimality to be close to zero, which is much smaller than O(C(Φ)^2κdH^4) in the setting without multitask representation learning, whose sub-optimality gap is Õ (H^2\sqrt{κC(Φ)^2dn}). This theoretically explains the power of multitask representation learning in reducing sample complexity. Further, we note that to ensure high sample efficiency, the LAFA criterion κ should be small. In fact, κ varies widely in magnitude depending on the different sampling distribution for new task. This indicates adaptive sampling technique is important to make κ solely depend on d. Finally, we provide empirical results of a noisy grid-world environment to corroborate our theoretical findings.

Speaker Bio: Rui Lu is a first-year graduate student advised by professor Gao Huang in the Department of Automation at Tsinghua University. Their research interests are broadly in machine learning such as computer vision, reinforcement learning, robustness, and theoretical understanding of deep learning. Prior to graduate study, they received a bachelor's degree in Yao class at Tsinghua University (2016.9-2021.6).