Past Seminars - Spring 2021

Tuesday 27th July 2021, 5pm UTC

Speaker: Xiuyuan Lu (DeepMind)

Title: Reinforcement Learning, Bit by Bit

Paper: https://arxiv.org/abs/2103.04047

Slides: here

More details:

Reinforcement Learning, Bit by Bit

Authors: Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen

Abstract: Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data efficiency poses an impediment to carrying this success over to real environments. The design of data-efficient agents calls for a deeper understanding of information acquisition and representation. We develop concepts and establish a regret bound that together offer principled guidance. The bound sheds light on questions of what information to seek, how to seek that information, and it what information to retain. To illustrate concepts, we design simple agents that build on them and present computational results that demonstrate improvements in data efficiency.

Speaker Bio: Xiuyuan Lu is a research scientist at DeepMind. She is interested in developing data-efficient reinforcement learning agents. Before DeepMind, she graduated from Stanford University with a PhD in Operations Research under the supervision of Professor Benjamin Van Roy.

Tuesday 13th July 2021, 5pm UTC

Speaker: Pierre Ménard (Universität Magdeburg)

Title: Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Paper: https://arxiv.org/abs/2106.06279

Slides: here

More details:

Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

Authors: Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko

Abstract: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/sqrt(T) where T is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.

Speaker Bio: Pierre Ménard is a postdoctoral researcher at Institut für Mathematische Stochastik in the Otto-von-Guericke-Universität Magdeburg in collaboration with Alexandra Carpentier. His work is mainly focused on bandit and reinforcement learning. Previously he was a postdoctoral researcher in the SequeL team in collaboration with Emilie Kaufmann and Michal Valko. He obtained his PhD from the Université de Toulouse III - Paul Sabatier under the supervision of Aurélien Garivier and Gilles Stoltz.

Tuesday 6th July 2021, 5pm UTC

Speaker: Gellért Weisz (DeepMind / UCL)

Title: On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function

Paper: https://arxiv.org/abs/2102.02049

Slides: here

More details:

On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function

Authors: Gellért Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, Csaba Szepesvári

Abstract: We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies in the span of a feature map that is accessible through the generative model. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in H (the horizon of the MDP) or d (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly(H, d) learning is possible (with state value function realizability) whenever the action set is small (i.e. O (1)). In particular, we present the TENSORPLAN algorithm which uses poly( (dH/δ)ᴬ) queries to find a δ-optimal policy relative to any deterministic policy for which the value function is linearly realizable with a parameter from a fixed radius ball around zero. This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. The upper bound is complemented by a lower bound which proves that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.

Speaker Bio: Gellért Weisz is a Ph.D. candidate at UCL/DeepMind, under the supervision of Prof. Csaba Szepesvári. He received his master's degree in machine learning from the University of Cambridge. His research interests lie in Reinforcement Learning theory. He is currently focused on the feasibility of efficient RL in the presence of low-dimensional linear structure.

Tuesday 29th June 2021, 5pm UTC

Speaker: Jean Tarbouriech (FAIR and INRIA Scool)

Title: Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Paper: https://arxiv.org/abs/2104.11186

Slides: here

More details:

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Authors: Jean Tarbouriech, Runlong Zhou, Simon S. Du, Matteo Pirotta, Michal Valko, Alessandro Lazaric

Abstract: We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate O˜(B*sqrt{SAK}), where K is the number of episodes, S is the number of states, A is the number of actions and B* bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of B*, nor of T* which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of T* is available) where the regret only contains a logarithmic dependence on T*, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.

Speaker Bio: Jean is a PhD student at Facebook AI Research Paris and Inria Lille (Scool team), advised by Alessandro Lazaric and Michal Valko. Prior to that, he received an MSc in Applied Mathematics from Ecole Polytechnique and an MSc in Machine Learning at ENS Paris-Saclay. His main research interest relates to exploration in unsupervised RL and goal-oriented RL.

Tuesday 22nd June 2021, 5pm UTC

Speaker: Anastasios Tsiamis (U. Pennsylvania)

Title: Linear Systems can be Hard to Learn

Paper: https://arxiv.org/abs/2104.01120

Slides: here

More details:

Linear Systems can be Hard to Learn

Authors: Anastasios Tsiamis, George J. Pappas

Abstract: In this paper, we investigate when system identification is statistically easy or hard, in the finite sample regime. Statistically easy to learn linear system classes have sample complexity that is polynomial with the system dimension. Most prior research in the finite sample regime falls in this category, focusing on systems that are directly excited by process noise. Statistically hard to learn linear system classes have worst-case sample complexity that is at least exponential with the system dimension, regardless of the identification algorithm. Using tools from minimax theory, we show that classes of linear systems can be hard to learn. Such classes include, for example, under-actuated or under-excited systems with weak coupling among the states. Having classified some systems as easy or hard to learn, a natural question arises as to what system properties fundamentally affect the hardness of system identifiability. Towards this direction, we characterize how the controllability index of linear systems affects the sample complexity of identification. More specifically, we show that the sample complexity of robustly controllable linear systems is upper bounded by an exponential function of the controllability index. This implies that identification is easy for classes of linear systems with small controllability index and potentially hard if the controllability index is large. Our analysis is based on recent statistical tools for finite sample analysis of system identification as well as a novel lower bound that relates controllability index with the least singular value of the controllability Gramian.

Speaker Bio: Anastasios Tsiamis is a PhD student in the ESE department at the University of Pennsylvania, where he is advised by George Pappas. He received the Diploma degree in Electrical and Computer Engineering from the National Technical University of Athens, Greece. His research interests lie in the intersection of control theory and machine learning. In particular, his recent work focuses on understanding how fundamental control theoretic properties affect the difficulty of learning in system identification, online estimation and control. Previously, he worked on risk-aware control design and control systems security and privacy.


Tuesday 15th June 2021, 5pm UTC

Speaker: Paria Rashidinejad (UC Berkeley)

Title: Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Paper: https://arxiv.org/abs/2103.12021

Slides: here

More details:

Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

Authors: Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, Stuart Russell

Abstract: Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown a priori. To bridge this gap, we present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation from the behavior policy to the expert policy alone.

Under this new framework, we further investigate the question on algorithm design: can one develop an algorithm that achieves a minimax optimal rate and also adapts to unknown data composition? To address this question, we consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL. We study finite-sample properties of LCB as well as information-theoretic limits in multi-armed bandits, contextual bandits, and Markov decision processes (MDPs). Our analysis reveals surprising facts about optimality rates. In particular, in all three settings, LCB achieves a faster rate of 1/N for nearly-expert datasets compared to the usual rate of 1/sqrt{N} in offline RL, where N is the number of samples in the batch dataset. In the case of contextual bandits with at least two contexts, we prove that LCB is adaptively optimal for the entire data composition range, achieving a smooth transition from imitation learning to offline RL. We further show that LCB is almost adaptively optimal in MDPs.

Speaker Bio: Paria Rashidinejad is a Ph.D. candidate in EECS at UC Berkeley, working with Prof. Stuart Russell and Prof. Jiantao Jiao. She received her bachelor's degree in electrical engineering from Sharif University of Technology. Her research interests lie in the areas of machine learning, statistics, and optimization. She is currently focused on the theory and application of reinforcement learning and developing efficient algorithms for inference and decision making in dynamical systems.

Tuesday 8th June 2021, 5pm UTC

Speaker: Yuanhao Wang (Princeton)

Title: An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Paper: https://arxiv.org/abs/2103.12690

Slides: here

More details:

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Authors: Yuanhao Wang, Ruosong Wang, Sham M. Kakade

Abstract: A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in the linear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal Q-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption.

This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Q-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model).

Speaker Bio: Yuanhao is a 1st year PhD student at the CS department at Princeton University. Before that, he completed his undergraduate studies at Institute for Interdisciplinary Information Sciences, Tsinghua University (a.k.a. Andrew Yao’s CS pilot class). His research interests include reinforcement learning and minimax optimization.

Tuesday 1st June 2021, 5pm UTC

Speaker: Aviv Rosenberg (Tel-Aviv University)

Title: Minimax Regret for Stochastic Shortest Path

Paper: https://arxiv.org/abs/2103.13056

Slides: here

More details:

Minimax Regret for Stochastic Shortest Path

Authors: Alon Cohen, Yonathan Efroni, Yishay Mansour and Aviv Rosenberg

Abstract: We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to learn to approximate the optimal policy as closely as possible. In this work we show that the minimax regret for this setting is O(B* \sqrt{|S||A|K} where B* is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the lower bound of Rosenberg et al. (2020) up to logarithmic factors, and improves their regret bound by a factor of \sqrt{|S|}. Our algorithm runs in polynomial-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends only logarithmically on the horizon, yielding the same regret guarantees for SSP.

Speaker Bio: Aviv Rosenberg is a PhD candidate in the department of computer science at Tel Aviv University, advised by Prof. Yishay Mansour. Prior to that, he received his Bachelor's degree in Mathematics and Computer Science at Tel Aviv University. His main research interest is the theoretical foundations of sequential decision making, and his research lies in the intersection between reinforcement learning, online learning and optimization. In particular, he is interested in reinforcement learning with robust theoretical guarantees.

Tuesday 18th May 2021, 5pm UTC

Speaker: Jeongyeol Kwon (UT Austin)

Title: RL for Latent MDPs: Regret Guarantees and a Lower Bound

Paper: https://arxiv.org/abs/2102.04939

Slides: here

More details:

RL for Latent MDPs: Regret Guarantees and a Lower Bound

Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

Abstract: In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least \Omega((SA)^M) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., Boots et al.) and a reachability assumption, we show that the need for initialization can be removed.

Speaker Bio: Jeongyeol is a 4th year PhD candidate in the ECE department at The University of Texas at Austin. His research interests focus on robust decision-making in large-scale complex systems, with a focus on theoretical aspects of learning and computation. He has been particularly interested in efficient and practical algorithms for learning mixture and latent variable problems. He has been fascinated by the remarkable EM algorithm, and he have studied various aspects of its performance including its sample and minimax optimality for learning Gaussian mixtures and mixed linear regression. Jeongyeol is currently focusing on developing efficient reinforcement learning algorithms in partially observable domains. Jeongyeol is grateful for support from the Kwanjeong Educational Foundation Scholarship, and from the UT Continuing Graduate Fellowship.

Tuesday 11th May 2021, 5pm UTC

Speaker: Aymen Al Marjani (ENS Lyon)

Title: Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Paper: https://arxiv.org/abs/2009.13405

Slides: here

More details:

Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Authors: Aymen Al Marjani, Alexandre Proutiere

Abstract: We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.

Speaker Bio: Aymen is a first year PhD student at ENS Lyon, supervised by Aurélien Garivier and Emilie Kaufmann. Before that, he got a Msc in Mathematics, Vision and Learning (MVA) from ENS-Paris Saclay with a double degree in statistics from Ecole Polytechnique. Aymen also spent 6 months at KTH where he wrote his master's thesis under supervision of Alexandre Proutiere. Aymen's research interests include Reinforcement Learning, Bandits and sequential Statistics.

Tuesday 4th May 2021, 5pm UTC

Speaker: Kefan Dong (Stanford)

Title: Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Paper: https://arxiv.org/abs/2102.04168

Slides: here

More details:

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Authors: Kefan Dong, Jiaqi Yang, Tengyu Ma

Abstract: This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.

Speaker Bio: Kefan Dong is a first-year Ph.D. student at Computer Science Department, Stanford University, advised by Professor Tengyu Ma. Before that, he studied as an undergraduate at Institute for Interdisciplinary Information Sciences, Tsinghua University (a.k.a. Andrew Yao’s CS pilot class). His research interests are reinforcement learning and online learning.

Tuesday 27th April 2021, 5pm UTC

Speaker: Lin Chen (UC Berkeley)

Title: Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Paper: https://arxiv.org/abs/2103.09847

Slides: here

More details:

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Authors: Lin Chen, Bruno Scherrer, Peter L. Bartlett

Abstract: In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime d \gamma>0, where d is the dimension of the feature vector and \gamma is the discount rate. In this regime, for any q \in [\gamma^2,1], we can construct a hard instancesuch that the smallest eigenvalue of its feature covariance matrix is q/d and it requires \Omega(\d/\gamma^{2}(q-\gamma^{2})\epsilon^{2}}\exp\(\Theta(d\gamma^{2}))) samples to approximate the value function up to an additive error \epsilon. Note that the lower bound of the sample complexity is exponential in d. If q=\gamma^2, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most O(\max\{ \frac{ ||\theta^{\pi}||_{2}^{4}}{\epsilon^{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon^{2}}(d+\log\frac{1}{\delta})\}) samples (\theta^{\pi} is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of with probability at least 1-\delta.

Speaker bio: Lin Chen is a postdoctoral scholar at the Simons Institute for the Theory of Computing, University of California, Berkeley. Prior to joining Berkeley, he completed his PhD at Yale University in 2020. His research interests focus on machine learning theory.

Tuesday 20th April 2021, 5pm UTC

Speaker: Tengyu Xu (Ohio State)

Title: A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis

Paper: https://arxiv.org/abs/2011.05869

Slides: here

More details:

A Primal Approach to Constrained Policy Optimization: Global Optimality and Finite-Time Analysis

Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan

Abstract: Safe reinforcement learning (SRL) problems are typically modeled as constrained Markov Decision Process (CMDP), in which an agent explores the environment to maximize the expected total reward and meanwhile avoids violating certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an convergence rate to the global optimal policy in the constrained policy set and an O(1/sqrt(T)) error bound on constraint satisfaction. This is the first finite-time analysis of SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

Speaker bio: Tengyu is a fourth year PhD student at the Department of Electrical and Computer Engineering, Ohio State University, advised by Yingbin Liang. Prior to OSU, he earned his B.Eng. in ME from Xi’an Jiaotong University. His research interests lie in the theoretical foundations of reinforcement learning and large-scale stochastic optimization. His current focus is in developing sample efficient reinforcement learning algorithms in various settings. In addition, he is also interested in the applications of reinforcement learning in autoML and NLP.

Tuesday 6th April 2021, 5pm UTC

Speaker: Ying Jin (Stanford)

Title: Is Pessimism Provably Efficient for Offline RL?

Paper:https://arxiv.org/abs/2012.15085

Slides:

More details:

Is Pessimism Provably Efficient for Offline RL?

Authors: Ying Jin, Zhuoran Yang, Zhaoran Wang

Abstract: We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators.

Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the ``best effort'' among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the ``irrelevant'' trajectories that are less covered by the dataset and not informative for the optimal policy.

Speaker bio: Ying is a second-year PhD student at Department of Statistics, Stanford University. Prior to Stanford, she earned her BS in Mathematics and a double degree in Ecomonics (Finance), both from Tsinghua University, China. Her research interests include Causal Inference, Reinforcement/Policy Learning and Graphs and Networks.

Tuesday 30th March 2021, 5pm UTC

Speaker: Aditya Modi (U. Michigan)

Title: Model-free Representation Learning and Exploration in Low-rank MDPs

Paper: https://arxiv.org/abs/2102.07035

Slides: here

More details:

Model-free Representation Learning and Exploration in Low-rank MDPs

Authors: Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal

Abstract: The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.

Speaker bio: Aditya is a PhD candidate at the Computer Science and Engineering department in University of Michigan, Ann Arbor, advised by Satinder Singh and Ambuj Tewari. Before coming to Michigan, Aditya completed his undergraduate degree in Computer Science from Indian Institute of Technology, Kanpur in May 2016. His research interests lie in developing methods with provable guarantees for interactive learning and sequential decision making frameworks like reinforcement learning, bandits and (general) online learning. His current focus is on the sample efficiency of reinforcement learning methods under various structural assumptions. Additionally, Aditya has an active interest in real-world applications of reinforcement learning methods.

Tuesday 23rd March 2021, 5pm UTC

Speaker: Qinghua Liu (Princeton)

Title: Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms

Paper: https://arxiv.org/abs/2102.00815

Slides: here

More details:

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms

Authors: Chi Jin, Qinghua Liu, Sobhan Miryoosefi

Abstract: Finding the minimal structural assumptions that empower sample-efficient learning is one of the most important research directions in Reinforcement Learning (RL). This paper advances our understanding of this fundamental question by introducing a new complexity measure -- Bellman Eluder (BE) dimension. We show that the family of RL problems of low BE dimension is remarkably rich, which subsumes a vast majority of existing tractable RL problems including but not limited to tabular MDPs, linear MDPs, reactive POMDPs, low Bellman rank problems as well as low Eluder dimension problems. This paper further designs a new optimization-based algorithm -- GOLF, and reanalyzes a hypothesis elimination-based algorithm -- OLIVE (proposed in Jiang et al. (2017)). We prove that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space. Our regret and sample complexity results match or improve the best existing results for several well-known subclasses of low BE dimension problems.

Speaker bio: Qinghua is a third-year PhD student at Princeton University advised by Chi Jin. He is interested in reinforcement learning with function approximation, multi-agents or partial information.

Tuesday 16th March 2021, 5pm UTC

Speaker: Yi Su (Cornell)

Title: Adaptive Estimator Selection for Off-Policy Evaluation

Paper: https://arxiv.org/abs/2002.07729

Slides: here

More details:

Adaptive Estimator Selection for Off-Policy Evaluation

Authors: Yi Su, Pavithra Srinath, Akshay Krishnamurthy

Abstract: We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings. We establish a strong performance guarantee for the method, showing that it is competitive with the oracle estimator, up to a constant factor. Via in-depth case studies in contextual bandits and reinforcement learning, we demonstrate the generality and applicability of the method. We also perform comprehensive experiments, demonstrating the empirical efficacy of our approach and comparing with related approaches. In both case studies, our method compares favorably with existing methods.

Speaker bio: Yi is a PhD student in the Department of Statistics and Data Science at Cornell University, advised by Professor Thorsten Joachims and supported by a Bloomberg Data Science Ph.D. Fellowship. Previously, Yi received their BSc (Honors) in Mathematics from Nanyang Technological University in Singapore. During her undergraduate years, Yi spent a summer at University of California, Berkeley and a year at the Department of Statistics and Applied Probability, National University of Singapore. In Spring 2019, Yi worked as a Research Intern at the Machine Learning Group at Microsoft Research NYC, advised by Miro Dudik and Akshay Krishnamurthy. In Summer 2020, Yi worked as a Research Intern at Bloomberg AI. Yi is interested in machine learning methods, algorithms, and systems. Specifically she is interested in learning from user behavioral data and implicit feedback in search engines, recommender systems and market platforms. Yi's current interest lies in off-policy evaluation and learning in contextual bandits and reinforcement learning.

Tuesday 9th March 2021, 6pm UTC

Speaker: Ruosong Wang (CMU)

Title: What are the Statistical Limits of Offline RL with Linear Function Approximation?

Paper: https://arxiv.org/abs/2010.11895

Slides: here

More details:

What are the Statistical Limits of Offline RL with Linear Function Approximation?

Authors: Ruosong Wang, Dean P. Foster, Sham M. Kakade

Abstract: Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions.

This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

Speaker bio: Ruosong Wang is a fourth year Ph.D. student at Carnegie Mellon University, advised by Prof. Ruslan Salakhutdinov. He received his B.Eng. from IIIS, Tsinghua University. He has broad interest in the theory and applications of modern machine learning, and his recent research interests include theoretical foundations for reinforcement learning and deep learning.

Tuesday 2nd March 2021, 6pm UTC

Speaker: Liyu Chen (University of Southern California)

Title: Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

Paper: https://arxiv.org/abs/2012.04053

Slides: here

More details:

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

Authors: Liyu Chen, Haipeng Luo, Chen-Yu Wei

Abstract: We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is O˜(\sqrt{DT^⋆K}) and O˜(\sqrt{DT^⋆SAK}) for the full-information setting and the bandit feedback setting respectively, where D is the diameter, T^⋆ is the expected hitting time of the optimal policy, S is the number of states, A is the number of actions, and K is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs.

Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, %the usage of log-barrier with an increasing learning rate schedule, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.

Speaker bio: Liyu is a fourth year PhD student at University of Southern California. He is advised by Prof. Haipeng Luo. Liyu's research interests are in online learning and theory of reinforcement learning. Previously he did his bachelors at the Hong Kong Universiy of Science and Technology and did his final year thesis with Prof. Dit-Yan Yeung.

Tuesday 23rd February 2021, 6pm UTC

Speaker: Qinghua Liu (Princeton)

Title: Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

Paper: https://arxiv.org/abs/2006.12484

Slides: here

More details:

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

Authors: Chi Jin, Sham M. Kakade, Akshay Krishnamurthy, Qinghua Liu

Abstract: Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of O(1/\epsilon^2) for finding an ε-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efficient algorithm for POMDPs with deterministic state transitions.

Speaker bio: Qinghua is a third-year PhD student at Princeton University advised by Chi Jin. He is interested in reinforcement learning with function approximation, multi-agents or partial information.

Tuesday 16th February 2021, 6pm UTC

Speaker: Andrea Zanette (Stanford)

Title: Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

Paper: https://arxiv.org/abs/2012.08005

Slides: here

More details:

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

Authors: Andrea Zanette

Abstract: Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if the action-value function of \emph{every} policy admits a linear representation.

If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an \emph{online} algorithm, showing that there exist RL problems where \emph{batch RL is exponentially harder than online RL} even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work helps formalize the issue known as \emph{deadly triad} and explains that the \emph{bootstrapping} problem \citep{sutton2018reinforcement} is potentially more severe than the \emph{extrapolation} issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.

Speaker bio: Andrea is a PhD candidate in the Institute for Computational and Mathematical Engineering at Stanford University advised by prof Emma Brunskill and Mykel J. Kochenderfer. He also works closely with Alessandro Lazaric from Facebook Artificial Intelligence Research. Andreas research focuses on provably efficient methods for Reinforcement Learning, in particular, he develops agents capable of autonomous exploration. His research has been partially supported by Total. Before starting his PhD, Andrea was a master’s student in the same department, and his bachelors is in Mechanical Engineering from the University of Padova.