Past Seminars - AUTUMN 2023

Tuesday 5th December 2023, 6PM UTC

Speaker: Qinghua Liu (Princeton)

Title: Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Slides: here

More details:

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Authors: Qinghua Liu, Gellért Weisz, András György, Chi Jin, Csaba Szepesvári

Abstract: While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as simply combining of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For d-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an ε-optimal policy within Õ (d^2/ε^3) samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence Θ̃ (d^2). It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of d. For general function approximation that subsumes linear MDPs, Optimistic NPG, to our best knowledge, is also the first policy optimization algorithm that achieves the polynomial sample complexity for learning near-optimal policies.

Speaker Bio: Qinghua is a final-year PhD student in the ECE department at Princeton University, advised by Chi Jin. He also works closely with Csaba Szepesvári and Yu Bai. Previously, he received a BE degree in electrical engineering and a BS degree in mathematics from Tsinghua University in 2018. His research has been focused on the mathematical foundations of reinforcement learning, covering a diverse range of topics within the field. His notable contributions include (i) co-designing the first statistically efficient algorithm for learning observable POMDPs; (ii) co-proposing the V-learning algorithm, the first algorithm that breaks the curse of multi-agency in MARL; and (iii) co-designing the optimistic MLE and the GOLF algorithm for RL with general function approximation.

Tuesday 28th November 2023, 6PM UTC

Speaker: Zihan Zhang (Princeton)

Title: Settling the Sample Complexity of Online Reinforcement Learning

Paper: https://arxiv.org/abs/2307.13586

Slides: here

More details:

Settling the Sample Complexity of Online Reinforcement Learning

Authors: Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

Abstract: A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.

We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors)

min{sqrt(SAH^3K), HK},

where S is the number of states, A is the number of actions, H is the planning horizon, and K is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K≥1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield ε-accuracy) of (SAH^3)/ε^2 up to log factor, which is minimax-optimal for the full ε-range.

Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Speaker Bio: I am currently a postdoc in the Department of ECE at Princeton University. Before joining Princeton University, I obtained Ph.D. degree from the Department of Automation, Tsinghua University at Oct. 2022. I was an undergraduate at the Department of Automation, Tsinghua University in 2013-2017. I am interested in various machine learning problem and mainly study reinforcement learning and online learning.

Tuesday 21st November 2023, 6PM UTC

Speaker: Osama A. Hanna (UCLA)

Title: Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

Paper: https://arxiv.org/abs/2211.05632

Slides: here

More details:

Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

Authors: Osama A. Hanna, Lin F. Yang, Christina Fragouli

Abstract: In this paper, we address the stochastic contextual linear bandit problem, where a decision maker is provided a context (a random set of actions drawn from a distribution). The expected reward of each action is specified by the inner product of the action and an unknown parameter. The goal is to design an algorithm that learns to play as close as possible to the unknown optimal policy after a number of action plays. This problem is considered more challenging than the linear bandit problem, which can be viewed as a contextual bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we show that the stochastic contextual problem can be solved as if it is a linear bandit problem. In particular, we establish a novel reduction framework that converts every stochastic contextual linear bandit instance to a linear bandit instance, when the context distribution is known. When the context distribution is unknown, we establish an algorithm that reduces the stochastic contextual instance to a sequence of linear bandit instances with small misspecifications and achieves nearly the same worst-case regret bound as the algorithm that solves the misspecified linear bandit instances.

As a consequence, our results imply a O(d sqrt(TlogT)) high-probability regret bound for contextual linear bandits, making progress in resolving an open problem in (Li et al., 2019), (Li et al., 2021).

Our reduction framework opens up a new way to approach stochastic contextual linear bandit problems, and enables improved regret bounds in a number of instances including the batch setting, contextual bandits with misspecifications, contextual bandits with sparse unknown parameters, and contextual bandits with adversarial corruption.

Speaker Bio: Osama A. Hanna is currently pursuing his Ph.D. degree in Electrical and Computer Engineering at the University of California, Los Angeles (UCLA). He received his BS and MS degrees in Electrical Engineering from the Faculty of Engineering Cairo University and Nile University in Egypt in 2014 and 2018 respectively. He received the Award of Excellence from Cairo University in 2014, as well as a Masters Fellowship and a graduate Research Assistantship from Nile University for the years 2014-2018. In 2017, he received the Marie Sklodowska-Curie Fellowship, and in 2018, the Electrical and Computer Engineering Department Fellowship from UCLA. His research interests are in online learning theory, information theory and algorithms.

Tuesday 14th November 2023, 6PM UTC

Speaker: Zakaria Mhammedi (MIT)

Title: Efficient Model-Free Exploration in Low-Rank MDPs

Paper: https://arxiv.org/abs/2307.03997

Slides: here

More details:

Efficient Model-Free Exploration in Low-Rank MDPs

Authors: Zakaria Mhammedi, Adam Block, Dylan J. Foster, Alexander Rakhlin

Abstract: A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes -- where transition probabilities admit a low-rank factorization based on an unknown feature embedding -- offer a simple, yet expressive framework for RL with function approximation, but existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions such as latent variable structure, access to model-based function approximation, or reachability. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation and requiring no additional structural assumptions. Our algorithm, VoX, uses the notion of a generalized optimal design for the feature embedding as an efficiently computable basis for exploration, performing efficient optimal design computation by interleaving representation learning and policy optimization. Our analysis -- which is appealingly simple and modular -- carefully combines several techniques, including a new reduction from optimal design computation to policy optimization based on the Frank-Wolfe method, and an improved analysis of a certain minimax representation learning objective found in prior work.

Speaker Bio: Zak Mhammedi is a postdoc at MIT with Sasha Rakhlin and will soon join the learning theory team at Google. His current research focuses on designing practical and provably sample-efficient algorithms for Reinforcement Learning. Zak obtained his PhD at the Australian National University, where he did research on core machine learning topics and was advised by Wouter Koolen and Bob Williamson.

Tuesday 7th November 2023, 6PM UTC

Speaker: Yuda Song (CMU)

Title: Using Both Offline and Online Data Can Make RL Efficient

Paper: https://arxiv.org/abs/2210.06718

Slides: here

More details:

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

Authors: Yuda Song, Yifei Zhou, Ayush Sekhari, J. Andrew Bagnell, Akshay Krishnamurthy, Wen Sun

Abstract: We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma's Revenge.

Speaker Bio: Yuda Song is a 2nd year PhD student in the Machine Learning Department at CMU, advised by Aarti Singh and Drew Bagnell. His research focuses on efficient settings and algorithms in Reinforcement Learning. Previously, he obtained an MS in Machine Learning from CMU, advised by Kris Kitani, and a BS in Computer Science and Mathematics from UC San Diego, advised by Sicun Gao.

Tuesday 31st October 2023, 5PM UTC

Speaker: Jonathan Lee (Stanford)

Title: Learning in POMDPs is Sample-Efficient with Hindsight Observability

Paper: https://arxiv.org/abs/2301.13857

Slides: here

More details:

Learning in POMDPs is Sample-Efficient with Hindsight Observability

Authors: Jonathan N. Lee, Alekh Agarwal, Christoph Dann, Tong Zhang

Abstract: POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.

Speaker Bio: Jonathan Lee is a Computer Science PhD student at Stanford University advised by Emma Brunskill. He is broadly interested in machine learning and decision-making. Prior to Stanford, he graduated with a BS in Electrical Engineering & Computer Science at UC Berkeley where he worked on robot learning, advised by Ken Goldberg. He has previously spent time at Google as a research intern and a student researcher on the Brain team and the Learning Theory team.

Tuesday 24th October 2023, 5PM UTC

Speaker: Binghui Peng (Columbia)

Title: The complexity of non-stationary reinforcement learning

Paper: https://arxiv.org/abs/2307.06877

Slides: here

More details:

The complexity of non-stationary reinforcement learning

Authors: Christos Papadimitriou, Binghui Peng

Abstract: The problem of continual learning in the domain of reinforcement learning, often called non-stationary reinforcement learning, has been identified as an important challenge to the application of reinforcement learning. We prove a worst-case complexity result, which we believe captures this challenge: Modifying the probabilities or the reward of a single state-action pair in a reinforcement learning problem requires an amount of time almost as large as the number of states in order to keep the value function up to date, unless the strong exponential time hypothesis (SETH) is false; SETH is a widely accepted strengthening of the P ≠ NP conjecture. Recall that the number of states in current applications of reinforcement learning is typically astronomical. In contrast, we show that just adding a new state-action pair is considerably easier to implement.

Speaker Bio: Binghui Peng is a fifth (and final) year PhD student in the computer science department of Columbia University, advised by Christos Papadimitriou and Xi Chen. Previously, he obtained his bachelor degree from Tsinghua University. He is broadly interested in theoretical computer science and machine learning. His recent interests include the role of memory in learning, theoretical foundations of continual learning and learning in games.

Tuesday 17th October 2023, 5PM UTC

Speaker: Ching-An Cheng (Microsoft Research)

Title: ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data

Paper: https://arxiv.org/abs/2211.04538

Slides: here

More details:

ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data

Authors: Tengyang Xie, Mohak Bhardwaj, Nan Jiang, Ching-An Cheng

Abstract: We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.

Speaker Bio: Ching-An Cheng is a Senior Researcher of Microsoft Research. He is a practical theoretician whose work has developed foundations for designing principled algorithms that can tackle real-world challenges. His research studies structural properties in sequential decision making problems, especially in robotics, and aims to improve the learning efficiency of autonomous agents. His recent research focuses on reinforcement learning and imitation learning from offline data. Ching-An Cheng has received several best paper and finalist awards, including those from ICML, RSS, and AISTATS.

Tuesday 10th October 2023, 5PM UTC

Speaker: Xie Tengyang (Microsoft Research)

Title: The Role of Coverage in Online Reinforcement Learning

Paper: https://arxiv.org/abs/2210.04157

Slides: here

More details:

Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Authors: Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade

Abstract: Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data distribution with good coverage can enable sample-efficient online RL. Concretely, we show that coverability -- that is, existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability -- can be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms for sample-efficient exploration, even when the agent does not know said distribution. We complement this result by proving that several weaker notions of coverage, despite being sufficient for offline RL, are insufficient for online RL. We also show that existing complexity measures for online RL, including Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new complexity measure, the sequential extrapolation coefficient, to provide a unification.

Speaker Bio: Tengyang Xie is currently a postdoctoral researcher at Microsoft Research and will soon join UW-Madison as an Assistant Professor of Computer Science in Fall 2024. He has broad interests in designing provably efficient and practical algorithms for machine learning, especially for reinforcement learning and representation learning. His PhD research focuses on the foundations and methodologies of data-driven reinforcement learning. His work has received the Outstanding Paper Runner-up Award at ICML-2022.

Tuesday 3rd October 2023, 5PM UTC

Speaker: Fan Chen (MIT)

Title: Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Paper: https://arxiv.org/abs/2209.11745

Slides: here

More details:

Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Authors: Fan Chen, Song Mei, Yu Bai

Abstract: Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. We also generalize the DEC to give sample-efficient algorithms for all-policy model estimation, with applications for learning equilibria in Markov Games. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.

Speaker Bio: Fan Chen is a first-year Ph.D. student at the EECS Department of Massachusetts Institute of Technology, advised by Sasha Rakhlin and Constantinos Daskalakis. Prior to MIT, he obtained his B.S. in Mathematics from Peking University. During his time at Peking University, he is advised by Song Mei and Zaiwen Wen. Fan's research interests include Reinforcement Learning, Optimization, and their interplay with the theory of computation.

Tuesday 26th September 2023, 5PM UTC

Speaker: Angela Yuan (UCLA)

Title: A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

Paper: https://arxiv.org/abs/2209.15634

Slides: here

More details:

A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

Authors: Zixiang Chen, Chris Junchi Li, Angela Yuan, Quanquan Gu, Michael I. Jordan

Abstract: With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning (RL). In this paper, we propose a general framework that unifies model-based and model-free RL, and an Admissible Bellman Characterization (ABC) class that subsumes nearly all Markov Decision Process (MDP) models in the literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and the functional eluder dimension as a complexity measure of the ABC class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of dH. Our framework provides a generic interface to design and analyze new RL models and algorithms.

Speaker Bio: Huizhuo (Angela) Yuan is currently a graduate student researcher in the computer science department at UCLA. Before joining UCLA, she received her B.S. and PhD. degrees in Mathematics from Peking University. Her research centers on the theoretical and practical aspects of optimization, machine learning and reinforcement learning, with applications in the fields of medical imaging and drug discovery.

Tuesday 19th September 2023, 5PM UTC

Speaker: Dylan Foster (Microsoft Research)

Title: Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient

Paper: https://arxiv.org/abs/2301.08215

Slides: here

More details:

Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient

Authors: Dylan J. Foster, Noah Golowich, Yanjun Han

Abstract: A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts:

- They hold in expectation, with no restrictions on the class of algorithms under consideration.

- They hold globally, and do not rely on the notion of localization used by Foster et al. (2021).

- Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role.

We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.

Speaker Bio: Dylan Foster is a senior researcher at Microsoft Research, New England. Previously, he was a postdoctoral fellow at MIT, and received his PhD in computer science from Cornell University, advised by Karthik Sridharan. His research focuses on problems at the intersection of machine learning and decision making, with a recent emphasis on reinforcement learning. He has received several awards, including the best paper award at COLT (2019) and best student paper award at COLT (2018, 2019).

Tuesday 12th September 2023, 5PM UTC

Speaker: Audrey Huang (UIUC)

Title: Reinforcement Learning in Low-Rank MDPs with Density Features

Paper: https://arxiv.org/abs/2302.02252

Slides: here

More details:

Reinforcement Learning in Low-Rank MDPs with Density Features

Authors: Audrey Huang, Jinglin Chen, Nan Jiang

Abstract: MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.

Speaker Bio: Audrey is a 3rd year Ph.D. candidate in computer science at UIUC, advised by Nan Jiang. Her research focuses on the theory and design of reinforcement learning algorithms. Prior to UIUC, she obtained an MS from CMU’s Machine Learning Department, and a BS in Computer Science from Caltech.

Tuesday 5th September 2023, 5PM UTC

Speaker: Reda Ouhamma (EPFL)

Title: Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

Paper: https://arxiv.org/abs/2210.02087

Slides: here

More details:

Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration and Planning

Authors: Reda Ouhamma, Debabrota Basu, Odalric-Ambrym Maillard

Abstract: We study the problem of episodic reinforcement learning in continuous state-action spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of O(sqrt(d^3H^3K)), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by sqrt(H) and removes the handcrafted clipping deployed in existing RLSVI-type algorithms. Our regret bound is order-optimal with respect to H and K.

Speaker Bio: Reda Ouhamma joined the Systems Control and Multiagent Optimization Research team at EPFL as a Postdoctoral researcher in June. His interest span reinforcement learning theory, multi-armed bandits, LQR, and multi-agent reinforcement learning. Prior to joining EPFL, he pursued a PhD degree at the Inria Lille’s Scool team (ex SequeL), under the supervision of Odalric Maillard and Vianney Perchet. Before his PhD, Reda Ouhamma graduated from École Polytechnique and ENS Paris-Saclay.

Page updated

Google Sites

Report abuse