Speaker: Zakaria Mhammedi (Google Research)
Title: The Power of Resets in Online Reinforcement Learning
Paper: https://arxiv.org/abs/2404.15417
Slides: here
More Details:
The Power of Resets in Online Reinforcement Learning
Authors: Zakaria Mhammedi, Dylan J. Foster, Alexander Rakhlin
Abstract: Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access—particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with local simulator access (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach:
We show that MDPs with low coverability (Xie et al., 2023)—a general structural condition that subsumes Block MDPs and Low-Rank MDPs—can be learned in a sample-efficient fashion with only Q⋆-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions.
As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al., 2022b) is tractable under local simulator access.
The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.
Speaker Bio: Zak Mhammedi is a Research Scientist at Google Research, focusing on reinforcement learning and optimization. He completed his PhD in Computer Science at the Australian National University and previously held a postdoctoral position at MIT. Zak’s work bridges the gap between theoretical and practical AI, particularly in developing efficient reinforcement learning algorithms. He has presented at top conferences such as COLT, NeurIPS, and ICML, with several papers receiving oral and spotlight recognition.
Speaker: Audrey Huang (UIUC)
Title: Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
Paper: https://arxiv.org/abs/2407.13399
Slides: here
More Details:
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
Authors: Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster
Abstract: KL-regularization is widely employed in offline language modeling alignment algorithms, such as Direct Preference Optimization (DPO), yet these methods are limited by a pervasive phenomenon called overoptimization. Over the course of the alignment process, the quality of the language model degrades as a consequence of overfitting to an inaccurate reward function that is (either explicitly or implicitly) learned from offline data. Since out-of-distribution generalization is a statistical phenomenon, this raises a fundamental question: Are existing KL-regularized offline alignment algorithms doomed to overfit, and can we design a method that provably prevents overoptimization?
We address this problem with a new algorithm for offline alignment, Chi-squared Preference Optimization (ChiPO). ChiPO is a one-line change to Direct Preference Optimization (DPO), and modifies only the logarithmic link function in the DPO objective. Despite this minimal change, ChiPO implicitly implements the principle of “pessimism in the face of uncertainty” via regularization with the chi-squared divergence, which quantifies uncertainty more effectively than KL-regularization. We show that ChiPO provably alleviates overoptimization and achieves sample-complexity guarantees based on single-policy concentrability---the gold standard in offline reinforcement learning. In contrast, KL-regularized algorithms such as DPO may require exponentially more samples. ChiPO’s simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm that is provably robust to overoptimization.
Speaker Bio: Audrey Huang is a fourth-year PhD candidate in Computer Science advised by Nan Jiang at the University of Illinois Urbana-Champaign. Currently, she is excited about combining ideas from RL theory with the capabilities and structures of language modeling to design efficient and provable algorithms. She also works on the complexity of online/offline RL with function approximation, and imitation learning. Audrey has interned at Microsoft Research, Google, and Adobe, and received her MS from Carnegie Mellon University and BS from Caltech.
Speaker: Noah Golowich (MIT)
Title: Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions
Paper: https://arxiv.org/abs/2406.11640
Slides: here
More Details:
Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions
Authors: Noah Golowich, Ankur Moitra
Abstract: One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.
Speaker Bio: Noah Golowich is a PhD Student at Massachusetts Institute of Technology, advised by Constantinos Daskalakis and Ankur Moitra. His research interests are in theoretical machine learning, with particular focus on the connections between multi-agent learning, game theory, and online learning, and on the theory of reinforcement learning. He is supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship, and was supported by an MIT Akamai Fellowship.
Speaker: Runzhe Wu (Cornell University)
Title: Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics
Paper: https://arxiv.org/abs/2406.11810
Slides: here
More Details:
Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics
Authors: Runzhe Wu, Ayush Sekhari, Akshay Krishnamurthy, Wen Sun
Abstract: We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting, a setting that uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least square regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.
Speaker Bio: Runzhe Wu is a third-year PhD student in Computer Science at Cornell University, working under the supervision of Wen Sun. His research centers on reinforcement learning, with a focus on developing principled and efficient algorithms. Currently, he is developing provably efficient algorithms inspired by generative models such as diffusion models to improve performance in RL and IL tasks. He received his bachelor’s degree from the ACM Honors Class of Shanghai Jiao Tong University.
Speaker: Asaf Cassel (Tel Aviv University)
Title: Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
Paper: https://arxiv.org/abs/2407.03065
Slides: here
More Details:
Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
Authors: Asaf Cassel, Aviv Rosenberg
Abstract: Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.
Speaker Bio: Asaf Cassel is a fifth-year PhD student in Computer Science at Tel-Aviv University, working under the supervision of Tomer Koren. Prior to that, he completed both his MSc, under the supervision of Shie Mannor, and BSc in the EE department at the Technion. Asaf’s research interests lie primarily in reinforcement learning theory with an emphasis on non-tabular problem settings.
Speaker: Yaqi Duan (NYU)
Title: Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces
Paper: https://arxiv.org/abs/2401.05233
Slides: here
More Details:
Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces
Authors: Yaqi Duan, Martin J. Wainwright
Abstract: We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.
Speaker Bio: Yaqi Duan is an Assistant Professor at the Department of Technology, Operations, and Statistics at Stern School of Business at New York University. Duan’s research interests lie at the intersection of statistics, machine learning and operations research, with a focus on statistical methodologies and theories to address challenges in data-driven decision-making problems. Her work in reinforcement learning was honored with the 2023 IMS Lawrence D. Brown Ph.D. Student Award. Duan earned her Ph.D. from Princeton University and a B.S. in Mathematics from Peking University. Prior to joining NYU Stern, Duan was a postdoctoral researcher at MIT, working with Professor Martin J. Wainwright.