Speaker: Dylan Foster (MIT)
Title: Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective
Paper: https://arxiv.org/abs/2010.03104
Slides: here
More details:
Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective
Authors: Dylan J. Foster, Alexander Rakhlin, David Simchi-Levi, Yunzong Xu
Abstract: In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm. Are similar guarantees possible for contextual bandits? While positive results are known for certain special cases, there is no general theory characterizing when and how instance-dependent regret bounds for contextual bandits can be achieved for rich, general classes of policies. We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds. We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case. Finally, we provide structural results that tie together a number of complexity measures previously proposed throughout contextual bandits, reinforcement learning, and active learning and elucidate their role in determining the optimal instance-dependent regret. In a large-scale empirical evaluation, we find that our approach often gives superior results for challenging exploration problems.
Turning our focus to reinforcement learning with function approximation, we develop new oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.
Speaker bio: Dylan is a postdoctoral researcher at the MIT Institute for Foundations of Data Science. Previously (08/2014-01/2019) he was a PhD student in the Department of Computer Science at Cornell University advised by Karthik Sridharan and graduated with a BS and MS in electrical engineering from the University of Southern California in 2014. His research focuses on learning for decision making, including online learning/sequential prediction, contextual bandits, reinforcement learning, and control. Dylan is interested in uncovering new algorithmic principles and fundamental limits for these settings, with themes including Sample efficiency, Computational efficiency and Adaptivity. More broadly, he is interested in all aspects of generalization, sample complexity, and related algorithmic problems in modern/real-world settings, including non-convex optimization/learning, and deep learning.
Speaker: Simon S Du (University of Washington)
Title: Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon
Paper: https://arxiv.org/abs/2009.13503
Slides: here
More details:
Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon
Authors: Zihan Zhang, Xiangyang Ji, Simon S. Du
Abstract: Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with S states, A actions, planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. The new bonus only requires tweaking the \emph{constants} to ensure optimism and thus is significantly simpler than existing bonus constructions. We show MVP enjoys an O((\sqrt{SAK}+S^2A)polylog(SAHK)) regret, approaching the Ω(\sqrt{SAK}) lower bound of \emph{contextual bandits}. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019], Zanette et al. [2019] and Zhang et al. [2020] in terms of the dependency on H, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on S, A and K in sample complexity.
Speaker bio: Simon is an assistant professor in the Paul G. Allen School of Computer Science & Engineering at University of Washington. His research interests are broadly in machine learning such as deep learning, representation learning and reinforcement learning. Previously, Simon was a postdoc at Institute for Advanced Study of Princeton, hosted by Sanjeev Arora. He completed my Ph.D. in Machine Learning at Carnegie Mellon University, where he was co-advised by Aarti Singh and Barnabás Póczos. Simon has also spent time at Simons Institute and research labs of Facebook, Google and Microsoft.
Speaker: Xinyi Chen (Google / Princeton)
Title: Black-Box Control for Linear Dynamical Systems
Paper: https://arxiv.org/abs/2007.06650
Slides: here
More details:
Black-Box Control for Linear Dynamical Systems
Authors: Xinyi Chen, Elad Hazan
Abstract: We consider the problem of controlling an unknown linear time-invariant dynamical system from a single chain of black-box interactions, and with no access to resets or offline simulation. Under the assumption that the system is controllable, we give the first efficient algorithm that is capable of attaining sublinear regret in a single trajectory under the setting of online nonstochastic control. We give finite-time regret bound of our algorithm, as well as a nearly-matching lower bound that shows this regret to be almost best-attainable by any algorithm.
Speaker bio: Xinyi is a Ph.D. student in the Department of Computer Science at Princeton University advised by Prof. Elad Hazan. Xinyi is interested in algorithms for machine learning, specifically online learning and nonconvex optimization.
Speaker: Alekh Agarwal (Microsoft Research)
Title: PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
Paper: https://arxiv.org/abs/2007.08459
Slides: here
More details:
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
Authors: Alekh Agarwal, Mikael Henaff, Sham Kakade, Wen Sun
Abstract: Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case ℓ_∞ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
Speaker bio: Alekh is currently a researcher in Microsoft Research AI, where he leads the Reinforcement Learning research group. Previously, he spent six years in the New York lab of Microsoft Research after obtaining his PhD in Computer Science from UC Berkeley, working with Peter Bartlett and Martin Wainwright. Alekh is broadly interested in Machine Learning, Statistics and Optimization. He is currently working on several aspects of Interactive Machine Learning, including contextual bandits, reinforcement learning and active learning with an eye towards practical learning systems with strong theoretical guarantees. He has previously worked on tradeoffs between computational and statistical complexities, large-scale and distributed machine learning and statistical inference in high-dimensions.
Speaker: Thodoris Lykouris (Microsoft Research)
Title: Corruption robust exploration in episodic reinforcement learning
Paper: https://arxiv.org/abs/1911.08689
Slides: here
More details:
Corruption robust exploration in episodic reinforcement learning
Authors: Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun
Abstract: We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of uncertainty", by complementing them with principles from "action elimination". Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear-function-approximation settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
Speaker bio: Thodoris is a postdoctoral researcher at Microsoft Research NYC in the Machine Learning group. He obtained his Ph.D. in the Computer Science department of Cornell University where he was advised by Éva Tardos. Prior to that, he received a Diploma in EECS from National Technical University of Athens. Thodoris is broadly interested in machine learning, algorithms, optimization, and economics. In particular, his research focuses on multiple aspects of data-driven sequential decision making.
Speaker: Marc Abeille (Criteo)
Title: Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation
Paper: https://arxiv.org/abs/2007.06482
Slides: here
More details:
Efficient Optimistic Exploration in Linear-Quadratic Regulators via Lagrangian Relaxation
Authors: Marc Abeille, Alessandro Lazaric
Abstract: We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of OFU-LQ and cast it into a constrained extended LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove strong duality. As a result, we show that an ϵ-optimistic controller can be computed efficiently by solving at most O(log(1/ϵ)) Riccati equations. Finally, we prove that relaxing the original OFU problem does not impact the learning performance, thus recovering the O(√T) regret of OFU-LQ. To the best of our knowledge, this is the first computationally efficient confidence-based algorithm for LQR with worst-case optimal regret guarantees.
Speaker bio: Marc is a researcher at the Criteo AI Lab in Paris. Prior to that, he was at INRIA - Team SequeL, where he completed a PhD under the supervision of Alessandro Lazaric and Remi Munos. His recent work has focused on model-based Reinforcement Learning and parametric Bandits, with a specific focus on linear systems and their generalizations. Marc is particularly interested in designing adaptive algorithms with theoretical performance guarantees that can be used in real-word systems. He is also interested in continuous time optimal control theory and its potential benefit for discrete time Reinforcement Learning problems.
Speaker: Pierre Ménard (Inria Lille)
Title: Adaptive Reward-Free Exploration
Paper: https://arxiv.org/abs/2006.06294
Slides: here
More details:
Adaptive Reward-Free Exploration
Authors: Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko
Abstract: Reward-free exploration is a reinforcement learning setting recently studied by Jin et al., who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs O(SAH^4/ε^2)ln(1/δ)) episodes to output, with probability 1−δ, an ε-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model.
Speaker bio: Pierre Ménard is a post-doctoral researcher in the Sequel team, at Inria Lille. He works with Emilie Kaufmann and Michal Valko on bandit theory and theoretical reinforcement learning. Previously, he obtained his PhD from the Université de Toulouse III - Paul Sabatier under the supervision of Aurélien Garivier and Gilles Stoltz.
Speaker: Yu-Xiang Wang (UC Santa Barbara)
Title: Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning
Paper: https://arxiv.org/abs/2007.03760
Slides: here
More details:
Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning
Authors: Ming Yin, Yu Bai, Yu-Xiang Wang
Abstract: The Off-Policy Evaluation aims at estimating the performance of target policy π using offline data rolled in by a logging policy μ. Intensive studies have been conducted and the recent marginalized importance sampling (MIS) achieves the sample efficiency for OPE. However, it is rarely known if uniform convergence guarantees in OPE can be obtained efficiently. In this paper, we consider this new question and reveal the comprehensive relationship between OPE and offline learning for the first time. For the global policy class, by using the fully model-based OPE estimator, our best result is able to achieve ϵ-uniform convergence with complexity O˜(H^3⋅min(S,H)/d_mϵ^2), where d_m is an instance-dependent quantity decided by μ. This result is only one factor away from our uniform convergence lower bound up to a logarithmic factor. For the local policy class, ϵ-uniform convergence is achieved with the optimal complexity O˜(H^3/d_mϵ^2) in the off-policy setting. This result complements the work of sparse model-based planning (Agarwal et al. 2019) with generative model. Lastly, one interesting corollary of our intermediate result implies a refined analysis over simulation lemma.
Speaker bio: Yu-Xiang Wang is a faculty member of the computer science department in UCSB. Prior joining UCSB, he was a scientist with Amazon AI in Palo Alto. Even before that he was with the Machine Learning Department in Carnegie Mellon University and had the pleasure of being jointly advised by Stephen Fienberg, Alex Smola, Ryan Tibshirani and Jing Lei. Over the years he has worked on a diverse set of problems in the broad area of statistical machine learning, e.g., trend filtering, differential privacy, subspace clustering, large-scale learning / optimization, bandits / reinforcement learning, just to name a few. Recently he has worked on making differential privacy practical and developing a statistical foundation for off-policy reinforcement learning.
Speaker: Zhuoran Yang (Princeton)
Title: Provably Efficient Exploration in Policy Optimization
Paper: https://arxiv.org/abs/1912.05830
Slides: here
More details:
Provably Efficient Exploration in Policy Optimization
Authors: Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves O(\sqrt{d^2H^3T}) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
Speaker bio: Zhuoran is a Ph.D. candidate in the Department of Operations Research and Financial Engineering at Princeton University advised by Professor Jianqing Fan and Professor Han Liu. Prior to attending Princeton, he obtained a Bachelor of Mathematics degree from Tsinghua University. Zhuoran's research interests lie in the interface between machine learning, statistics and optimization. The primary goal of his research is to design efficient learning algorithms for large-scale decision making problems that arise in reinforcement learning and stochastic games, with both statistical and computational guarantees. In addition, he is also interested in the applications of reinforcement learning such as computer games and robotics.
Speaker: Chen-Yu Wei (USC)
Title: Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Paper: https://arxiv.org/abs/2007.11849
Slides: here
More details:
Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Authors: Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
Abstract: We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O˜(√T) regret and another computationally efficient variant with O˜(T^3/4) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with O˜(√T) regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with O˜(T^2/3) regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
Speaker bio: Chen-Yu Wei is a fourth-year Computer Science PhD student at University of Southern California, advised by Haipeng Luo. They received their M.S. in Communication Engineering and B.S. in Electrical Engineering from National Taiwan University. Their research mainly focuses on designing provable algorithms for online decision making, reinforcement learning, and learning in multi-player games. They are also interested in designing representation learning and unsupervised learning algorithms with information-theoretic justifications.
Speaker: Ruosong Wang (CMU)
Title: Provably Efficient Reinforcement Learning with General Value Function Approximation
Paper: https://arxiv.org/abs/2005.10804
Slides: here
More details:
Provably Efficient Reinforcement Learning with General Value Function Approximation
Authors: Ruosong Wang, Ruslan Salakhutdinov, Lin F. Yang
Abstract: Value function approximation has demonstrated phenomenal empirical success in reinforcement learning (RL). Nevertheless, despite a handful of recent progress on developing theory for RL with linear function approximation, the understanding of general function approximation schemes largely remains missing. In this paper, we establish a provably efficient RL algorithm with general value function approximation. We show that if the value functions admit an approximation with a function class F, our algorithm achieves a regret bound of O(poly(dH) sqrt(T)), where d is a complexity measure of F that depends on the eluder dimension [Russo and Van Roy, 2013] and log-covering numbers, H is the planning horizon, and T is the number interactions with the environment. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment. Moreover, our algorithm is model-free and provides a framework to justify the effectiveness of algorithms used in practice.
Speaker bio: Ruosong Wang is a fourth year Ph.D. student at Carnegie Mellon University, advised by Prof. Ruslan Salakhutdinov. He received his B.Eng. from IIIS, Tsinghua University. He has broad interest in the theory and applications of modern machine learning, and his recent research interests include theoretical foundations for reinforcement learning and deep learning.