Speaker: Jean Tarbouriech (Google DeepMind)
Title: Probabilistic Inference in Reinforcement Learning Done Right
Paper: https://arxiv.org/abs/2311.13294
Slides: here
More Details:
Probabilistic Inference in Reinforcement Learning Done Right
Authors: Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue
Abstract: A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.
Speaker Bio: Jean Tarbouriech is a research scientist at Google DeepMind London currently working on large language models. He received his PhD in Reinforcement Learning from Inria Lille and Meta AI Paris.
Speaker: Adrienne Tuynman (INRIA)
Title: Finding good policies in average-reward Markov Decision Processes without prior knowledge
Paper: https://arxiv.org/abs/2405.17108
Slides: here
More Details:
Finding good policies in average-reward Markov Decision Processes without prior knowledge
Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann
Abstract: We revisit the identification of an ε-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, D, and the optimal bias span, H, which satisfy H≤D. Prior work have studied the complexity of ε-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D≃H for which the sample complexity to output an ε-optimal policy is Ω(SAD/ε^2) where S and A are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/ε^2 has been proposed, but it requires the knowledge of H. We first show that the sample complexity required to estimate H is not bounded by any function of S,A and H, ruling out the possibility to easily make the previous algorithm agnostic to H. By relying instead on a diameter estimation procedure, we propose the first algorithm for (ε,δ)-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in SAD/ε^2 in the regime of small ε, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in H cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in SAD2/ε^2, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
Speaker Bio: Adrienne Tuynman is a second year PhD student in the Inria Scool team in Lille, where she works with Emilie Kaufmann and Rémy Degenne. Before that, she completed her master's degree in the Paris Saclay MVA program. She is particularly interested in identification problems, in Multi-Armed Bandits and Reinforcement Learning.
Speaker: Victor Boone (Univ. of Grenoble)
Title: Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Paper: here
Slides: here
More Details:
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Authors: Victor Boone, Zihan Zhang
Abstract: In this talk, we will overview the current state-of-the-art on regret minimization in average reward Markov decision processes in the minimax setting. More specifically, we will deep-dive on model-based optimistic methods and insist on the importance of Bellman operators. We will explain why previous methods fail to achieve the minimax regret lower bound and what they where lacking to achieve it. As a matter of fact, the presented method "PMEVI" is a universal patch that makes most of existing model-based algorithms minimax optimal. Last but not least, we will discuss the limitations of our method, how these limitations can be interpreted and what comes afterwise.
Speaker Bio: Victor Boone is a Post doc at Inria Grenoble. He recently defended (Nov 2024) his PhD thesis, "Optimal regrets in Markov decision processes", done under the supervision of Bruno Gaujal & Panayotis Mertikopoulos at the Université Grenoble Alpes.
Speaker: Itai Shufaro (Technion)
Title: On Bits and Bandits: Quantifying the Regret-Information Trade-off
Paper: here
Slides: here
More Details:
On Bits and Bandits: Quantifying the Regret-Information Trade-off
Authors: Itai Shufaro, Nadav Merlis, Nir Weinberger, Shie Mannor
Abstract: In many sequential decision problems, an agent performs a repeated task. He then suffers regret and obtains information that he may use in the following rounds. However, sometimes the agent may also obtain information and avoid suffering regret by querying external sources. We study the trade-off between the information an agent accumulates and the regret it suffers. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. We also prove regret upper bounds using the amount of information the agent accumulates. These bounds show that information measured in bits, can be traded off for regret, measured in reward. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.
Speaker Bio: Itai Shufaro is an M.Sc. student in Electrical Engineering at the Technion, working under the supervision of Shie Mannor. He is a recipient of the Meyer Award. Before that, he obtained his bachelor’s degree in Electrical Engineering and Physics from the Technion. His research interests lie primarily in statistics and reinforcement learning.
Speaker: Jongmin Lee (Seoul National University)
Title: Accelerating Value Iteration via Anchoring
Paper: here
Slides: here
More Details:
Accelerating value Iteration via Anchoring
Authors: Jongmin Lee and Ernest Ryu
Abstract: Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a (γ^k)-rate, where γ is the discount factor. Surprisingly, however, the optimal rate for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov's acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a O(1/k)-rate for γ≈1 or even γ=1, while standard VI has rate O(1) for γ≥1−1/k, where k is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 4, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss--Seidel VI setups as well.
Speaker Bio: Jongmin Lee is a PhD student at Seoul National University, supervised by Ernest K. Ryu. His research focuses on reinforcement learning and optimization theory, with a particular interest in applying classical optimization algorithms and techniques to reinforcement learning frameworks.
Speaker: Dhruv Rohatgi (MIT)
Title: Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning
Paper: here
Slides: here
More Details:
Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning
Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi
Abstract: Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem.
It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.
Speaker Bio: Dhruv Rohatgi is a PhD Student at the Massachusetts Institute of Technology. His research interests are in theoretical machine learning, with particular focus on understanding fundamental statistical problems from a computational perspective. He is fortunate to be advised by Ankur Moitra, and was supported by an NDSEG Graduate Fellowship and an MIT Akamai Fellowship.
Speaker: David Cheikhi (Columbia University)
Title: On the Limited Representational Power of Value Functions and its Links to Statistical (In)Efficiency
Paper: here
Slides: here
More Details:
On the Limited Representational Power of Value Functions and its Links to Statistical (In)Efficiency
Authors: David Cheikhi, Daniel Russo
Abstract: Identifying the trade-offs between model-based and model-free methods is a central question in reinforcement learning. Value-based methods offer substantial computational advantages and are sometimes just as statistically efficient as model-based methods. However, focusing on the core problem of policy evaluation, we show information about the transition dynamics may be impossible to represent in the space of value functions. We explore this through a series of case studies focused on structures that arises in many important problems. In several, there is no information loss and value-based methods are as statistically efficient as model based ones. In other closely-related examples, information loss is severe and value-based methods are severely outperformed. A deeper investigation points to the limitations of the representational power as the driver of the inefficiency, as opposed to failure in algorithm design.
Speaker Bio: David Cheikhi is a fourth-year PhD student at Columbia Business School advised by Professor Dan Russo. His research interests are in sequential decision making with a focus on policy evaluation. He received his MS from Columbia University and his BS from Ecole Polytechnique.
Speaker: Zakaria Mhammedi (Google Research)
Title: Sample- and Oracle-Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions
Paper: here
Slides: here
More Details:
Sample- and Oracle-Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions
Authors: Zakaria Mhammedi
Abstract: Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.
Speaker Bio: Zak Mhammedi is a Research Scientist at Google Research, focusing on reinforcement learning and optimization. He completed his PhD in Computer Science at the Australian National University and previously held a postdoctoral position at MIT. Zak’s work bridges the gap between theoretical and practical AI, particularly in developing efficient reinforcement learning algorithms. He has presented at top conferences such as COLT, NeurIPS, and ICML, with several papers receiving oral and spotlight recognition.
Speaker: Vlad Tkachuk (University of Alberta)
Paper: https://arxiv.org/abs/2405.16809
Slides: here
More Details:
Authors: Vlad Tkachuk, Gellért Weisz, Csaba Szepesvári
Abstract: We consider offline reinforcement learning (RL) in H-horizon Markov decision processes (MDPs) under the linear qπ -realizability assumption, where the action-value function of every policy is linear with respect to a given d-dimensional feature function. The hope in this setting is that learning a good policy will be possible without requiring a sample size that scales with the number of states in the MDP. Foster et al. [2021] have shown this to be impossible even under concentrability, a data coverage assumption where a coefficient Cconc bounds the extent to which the state-action distribution of any policy can veer off the data distribution. However, the data in this previous work was in the form of a sequence of individual transitions. This leaves open the question of whether the negative result mentioned could be overcome if the data was composed of sequences of full trajectories. In this work we answer this question positively by proving that with trajectory data, a dataset of size poly(d, H, Cconc)/ε^2 is sufficient for deriving an ε-optimal policy, regardless of the size of the state space. The main tool that makes this result possible is due to Weisz et al. [2023], who demonstrate that linear MDPs can be used to approximate linearly qπ -realizable MDPs. The connection to trajectory data is that the linear MDP approximation relies on “skipping” over certain states. The associated estimation problems are thus easy when working with trajectory data, while they remain nontrivial when working with individual transitions. The question of computational efficiency under our assumptions remains open.
Speaker Bio: Vlad Tkachuk is a second-year PhD student in Computing Science at the University of Alberta, working under the supervision of Csaba Szepesvári and Xiaoqi Tan. He completed his master’s degree at the University of Alberta, also under the supervision of Csaba Szepesvári. Before that, he obtained his bachelor’s in Electrical Engineering from the University of Waterloo. His research interests lie primarily in reinforcement learning theory.