# Past Seminars - Spring 2022

## Tuesday 14th June 2022, 5PM UTC

Speaker: Daniel Russo (Columbia)

Title: Adaptivity and Confounding in Multi-Armed Bandit Experiments

Paper: https://arxiv.org/abs/2202.09036

Slides: here

More details:

Adaptivity and Confounding in Multi-Armed Bandit Experiments

Authors: Chao Qin, Daniel Russo

Abstract: We explore a new model of bandit experiments where a potentially nonstationary sequence of contexts influences arms' performance. Context-unaware algorithms risk confounding while those that perform correct inference face information delays. Our main insight is that an algorithm we call deconfounted Thompson sampling strikes a delicate balance between adaptivity and robustness. Its adaptivity leads to optimal efficiency properties in easy stationary instances, but it displays surprising resilience in hard nonstationary ones which cause other adaptive algorithms to fail.

Speaker Bio: Daniel joined the Decision, Risk, and Operations division of the Columbia Business School in Summer 2017. Daniels research lies at the intersection of statistical machine learning and online decision making, mostly falling under the broad umbrella of reinforcement learning. Outside academia, Daniel works with Spotify to apply reinforcement learning style models to audio recommendations.

Prior to joining Columbia, Daniel spent one great year as an assistant professor in the MEDS department at Northwestern's Kellogg School of Management and one year at Microsoft Research in New England as Postdoctoral Researcher. Daniel recieved his PhD from Stanford University in 2015, where he was advised by Benjamin Van Roy. In 2011 he recieved a BS in Mathematics and Economics from the University of Michigan.

## Tuesday 7th June 2022, 5PM UTC

Speaker: Matus Telgarsky (UIUC)

Title: Actor-critic is implicitly biased towards high entropy optimal policies

Paper: https://arxiv.org/abs/2110.11280

Slides: here

More details:

Actor-critic is implicitly biased towards high entropy optimal policies

Authors: Yuzheng Hu, Ziwei Ji, Matus Telgarsky

Abstract: We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like ϵ-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

Speaker Bio: Matus Telgarsky is an assistant professor at the University of Illinois, Urbana-Champaign, specializing in deep learning theory. He was fortunate to receive a PhD at UCSD under Sanjoy Dasgupta. Other highlights include: co-founding, in 2017, the Midwest ML Symposium (MMLS) with Po-Ling Loh; receiving a 2018 NSF CAREER award; organizing a Simons Insititute summer 2019 program on deep learning with Samy Bengio, Aleskander Madry, and Elchanan Mossel.

## Tuesday 31st May 2022, 5PM UTC

Speaker: Aadirupa Saha (MSR)

Title: Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

Paper: https://arxiv.org/abs/2111.12306

Slides: here

More details:

Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

Authors: Aadirupa Saha, Akshay Krishnamurthy

Abstract: We study the K-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes \emph{preference-based feedback} suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class . We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over . This resolves an open problem of Dudík et al. [2015] on oracle efficient, regret-optimal algorithms for contextual dueling bandits.

Speaker Bio: Aadirupa is a postdoctoral researcher at Microsoft Research New York City. She obtained her PhD from the department of Computer Science, Indian Institute of Science, Bangalore, advised by Aditya Gopalan and Chiranjib Bhattacharyya. Aadirupa was an intern at Microsoft Research, Bangalore, Inria, Paris, and Google AI, Mountain View. Aadirupas research interests include Bandits, Reinforcement Learning, Optimization, Learning theory, Algorithms.

## Tuesday 24th May 2022, 5PM UTC

Speaker: Yu Bai (Salesforce AI)

Title: When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Paper: https://arxiv.org/abs/2110.04184

Slides: here

More details:

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?

Authors: Ziang Song, Song Mei, Yu Bai

Abstract: Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound.

We investigate what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and $A_i$ actions per player. First, we design algorithms for learning an $\epsilon$-Coarse Correlated Equilibrium (CCE) in $\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an $\epsilon$-Correlated Equilibrium (CE) in $\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $\epsilon$-approximate Nash equilibrium within $\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only highlighting the dependence on $S$, $A_i$, and $\epsilon$), which only depends linearly in $\sum_{i\le m} A_i$ and significantly improves over existing efficient algorithms in the $\epsilon$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.

Speaker Bio: Yu Bai is currently a Senior Research Scientist at Salesforce AI Research. Yu's research interest lies broadly in machine learning, with recent focus on the theoretical foundations of multi-agent reinforcement learning and games, deep learning, and uncertainty quantification. Prior to joining Salesforce, Yu completed his PhD in Statistics at Stanford University.

## Tuesday 10th May 2022, 5PM UTC

Speaker: Xuezhou Zhang (Princeton)

Title: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Paper: https://arxiv.org/abs/2202.00063

Slides: here

More details:

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach

Authors: Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

Abstract: We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration.

Speaker Bio: Xuezhou Zhang is currently an Associate Research Scholar in the Department of Electrical and Computer Engineering at Princeton University, working with Mengdi Wang. He received his Ph.D. in Computer Sciences from the University of Wisconsin-Madison in 2021, advised by Jerry Zhu. Prior to that, he obtained his B.S. in Applied Mathematics from UCLA in 2016. He is a visitor at Simons Institute for the Theory of Computing in Spring 2022. His research primarily focuses on reinforcement learning theory and applications.

## Tuesday 3th May 2022, 5PM UTC

Speaker: Kwang-Sung Jun (U Arizona)

Title: Maillard Sampling: Boltzmann Exploration Done Optimally

Paper: https://arxiv.org/abs/2111.03290

Slides: here

More details:

Maillard Sampling: Boltzmann Exploration Done Optimally

Authors: Jie Bian, Kwang-Sung Jun

Abstract: The PhD thesis of Maillard (2013) presents a rather obscure algorithm for the K-armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a \textit{closed form}, which is not true for Thompson sampling, a widely-adopted bandit algorithm in the industry. This means that the bandit-logged data from running MS can be readily used for counterfactual evaluation, unlike Thompson sampling. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and \sqrt{KTlogT} minimax regret bound where T is the time horizon, which matches the known bounds for asymptotically optimal UCB. We then propose a variant of MS called MS+ that improves its minimax bound to \sqrt{KTlogK}. MS+ can also be tuned to be aggressive (i.e., less exploration) without losing the asymptotic optimality, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS+.

Speaker Bio: Kwang-Sung Jun is an assistant professor at the CS department, University of Arizona since 2019. His research interest is interactive machine learning with a focus on bandit problems, online learning, and confidence bounds. Before joining UA, he was a postdoc at Boston University with Dr. Francesco Orabona. Before then, he spent 9 years at University of Wisconsin-Madison for a PhD degree with Dr. Xiaojin (Jerry) Zhu and a postdoc with Drs. Robert Nowak, Rebecca Willett, and Stephen Wright.

## Tuesday 19th APRIL 2022, 5PM UTC

Speaker: Masatoshi Uehara (Cornell)

Title: Representation Learning for Online and Offline RL in Low-rank MDPs

Paper: https://arxiv.org/abs/2110.04652

Slides: here

More details:

Representation Learning for Online and Offline RL in Low-rank MDPs

Authors: Masatoshi Uehara, Xuezhou Zhang, Wen Sun

Abstract: This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from O˜(A^9d^7/(ϵ^10(1−γ)^22)) for FLAMBE to O˜(A^2d^4/(ϵ^2(1−γ)^5)) with d being the rank of the transition matrix (or dimension of the ground truth representation), A being the number of actions, and γ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.

Speaker Bio: Masatoshi is a PhD student at Cornell in computer science department. Previously, he was a PhD student at Harvard in statistics department. Masatoshi is interested in statistical learning theory and method for sequential decision making. In particular, his interests are reinforcement learning, causal inference, online learning and application to e-commerce. Masatoshi is advised by Nathan Kallus (Cornell).

## Tuesday 12th APRIL 2022, 5PM UTC

Speaker: Yunzong Xu (MIT)

Title: Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

Paper: https://arxiv.org/abs/2111.10919

Slides: here

More details:

Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

Authors: Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, Yunzong Xu

Abstract: We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data. Offline RL -- particularly when coupled with (value) function approximation to allow for generalization in large or continuous state spaces -- is becoming increasingly relevant in practice, because it avoids costly and time-consuming online data collection and is well suited to safety-critical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all Q-value functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits of offline RL are not well understood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sample-efficient offline RL. We resolve this conjecture in the positive by proving that in general, even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a non-trivial policy.

Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond supervised learning, and highlight a phenomenon called over-coverage which serves as a fundamental barrier for offline value function approximation methods. A consequence of our results for reinforcement learning with linear function approximation is that the separation between online and offline RL can be arbitrarily large, even in constant dimension.

Speaker Bio: Yunzong is a fourth-year Ph.D. student in the Institute for Data, Systems, and Society at MIT. He is affiliated with the Statistics and Data Science Center and the Laboratory for Information and Decision Systems, and is advised by David Simchi-Levi. In Summer 2021, he was a research intern at Microsoft Research NYC. Prior to joining MIT, he received his dual bachelor's degrees in information systems and mathematics from Tsinghua University in 2018. Yunzong is broadly interested in statistical machine learning and operations research. His current research interests include data-driven decision making, online and reinforcement learning, econometrics and causal inference, with applications to revenue management and healthcare.

## Tuesday 5th April 2022, 5PM UTC

Speaker: Dylan J. Foster (Microsoft)

Title: The Statistical Complexity of Interactive Decision Making

Paper: https://arxiv.org/abs/2112.13487

Slides: here

More details:

Sample complexity for average-reward MDP

Authors: Dylan J. Foster, Sham M. Kakade, Jian Qian, Alexander Rakhlin

Abstract: A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning. In particular, we provide:

1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit.

2. a unified algorithm design principle, Estimation-to-Decisions (E2D), which transforms any algorithm for supervised estimation into an online algorithm for decision making. E2D attains a regret bound matching our lower bound, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient.

Taken together, these results constitute a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds. More broadly, the approach can be viewed as a decision-theoretic analogue of the classical Le Cam theory of statistical estimation; it also unifies a number of existing approaches -- both Bayesian and frequentist.

Speaker Bio: Dylan is a senior researcher at Microsoft Research, New England, where he is a member of the Reinforcement Learning Group. Previously, Dylan was a postdoctoral fellow at the MIT Institute for Foundations of Data Science, and prior to this he received his PhD from the Department of Computer Science at Cornell University (2019), advised by Karthik Sridharan. Dylan received my BS and MS in electrical engineering from the University of Southern California in 2014. He works on theory for machine learning. Dylans research lies at the intersection of machine learning and decision making, including data-driven reinforcement learning and control, contextual bandits, and statistical learning in causal/counterfactual settings. He is interested in uncovering new algorithmic principles and fundamental limits for data-driven decision making, with themes including: Sample efficiency, Algorithm design, and Adaptivity.

## Tuesday 15th March 2022, 5PM UTC

Speaker: Yujia Jin (Stanford)

Title: Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Paper: https://arxiv.org/abs/2106.07046

Slides: here

More details:

Sample complexity for average-reward MDP

Authors: Yujia Jin, Aaron Sidford

Abstract: In this talk I will introduce recent progress on the sample complexity for solving average-reward Markov decision processes (AMDPs). Given access to a generative model for an AMDP with SA total number of state-action pairs, where every stationary policy has mixing time at most t_mix, we present two new methods for finding approximately-optimal stationary policies. Together, these methods yield a sample complexity upper bound on the order of SA t_mix/eps^2*min(t_mix, 1/eps)). We also provide a sample complexity lower bound of SA t_mix/eps^2 oblivious samples. This work makes progress toward designing new algorithms with better sample complexity for solving AMDPs, providing a generic convex optimization framework for solving MDPs, and demonstrating the close relationship between AMDPs and discounted MDPs. It also points to a key open problem of closing the gap between our upper and lower bounds.

The talk is based on joint work with Aaron Sidford.

Speaker Bio: Yujia is a fourth-year PhD student in the Department of Management Science and Engineering at Stanford in the Operations Research group. Yujia is advised by Aaron Sidford and is broadly interested in optimization problems, sometimes in the intersection with machine learning theory and graph applications. Yujia enjoys understanding the theoretical ground of many algorithms that are of practical importance.

Prior to coming to Stanford, Yujia received a Bachelor's degree in Applied Math at Fudan University, where she worked with Prof. Zhongzhi Zhang. From 2016 to 2018, Yujia also worked in Research Institute for Interdisciplinary Sciences (RIIS) at SHUFE, where she was advised by Prof. Dongdong Ge.

## Tuesday 8th March 2022, 6pm UTC

Speaker: Tong Zhang (HKUST)

Title: Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning

Paper: https://arxiv.org/abs/2110.00871

Slides: here

More details:

Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning

Authors: Tong Zhang

Abstract: Thompson Sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds. In this setting, we show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. A simple modification called Feel-Good Thompson Sampling, which favors high reward models more aggressively than the standard Thompson Sampling, is proposed to remedy this problem. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling. It is shown that in both cases, we can reduce the bandit regret problem to online least squares regression estimation. For the frequentist analysis, the online least squares regression bound can be directly obtained using online aggregation techniques which have been well studied. The resulting bandit regret bound matches the minimax lower bound in the finite action case. Moreover, the analysis can be generalized to handle a class of linearly embeddable contextual bandit problems (which generalizes the popular linear contextual bandit model). The obtained result again matches the minimax lower bound. Finally we illustrate that the analysis can be extended to handle some MDP problems.

Speaker Bio: Tong Zhang is a professor of Computer Science and Mathematics at The Hong Kong University of Science and Technology. Previously, he was a professor at Rutgers university, and worked at IBM, Yahoo, Baidu, and Tencent. Tong Zhang's research interests include machine learning algorithms and theory, statistical methods for big data and their applications. He is a fellow of ASA, IEEE, and IMS, and he has been in the editorial boards of leading machine learning journals and program committees of top machine learning conferences. Tong Zhang received a B.A. in mathematics and computer science from Cornell University and a Ph.D. in Computer Science from Stanford University.

## Tuesday 1st March 2022, 6pm UTC

Speaker: Yingjie Fei (Bloomberg)

Title: Risk-Sensitive Reinforcement Learning: Symmetry, Asymmetry, and Risk-Sample Tradeoff

Paper: https://arxiv.org/abs/2111.03947

Slides: here

More details:

Risk-Sensitive Reinforcement Learning: Symmetry, Asymmetry, and Risk-Sample Tradeoff

Authors: Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang

Abstract: We study risk-sensitive reinforcement learning based on the entropic risk measure. The problem is challenging as it admits nonlinear value functions and Bellman equations. We investigate the problem via a transformation of the Bellman equations, which we call exponential Bellman equation. We show that the induced error dynamics of Bellman backup is asymmetric in risk sensitivity and motivate the design of a novel exploration mechanism. With these findings, we propose two model-free algorithms based on value iteration and Q-learning, which adapt to both risk-seeking and risk-averse modes of exploration. Perhaps surprisingly, we show that the proposed algorithms attain regret upper bounds that are symmetric in risk sensitivity and exhibit a tradeoff between risk sensitivity and sample complexity. We also provide a lower bound to certify the near optimality of the upper bounds.

Speaker: Yingjie Fei (Bloomberg)

Speaker Bio: Yingjie is a research engineer at Bloomberg. He received his PhD from Cornell University. He is broadly interested in problems related to efficient and robust decision making.

## Tuesday 22nd February 2022, 6pm UTC

Speaker: Sean Meyn (U. Florida)

Title: The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

Paper: https://arxiv.org/abs/2110.14427

Slides: here

More details:

The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning

Authors: Vivek Borkar, Shuhang Chen, Adithya Devraj, Ioannis Kontoyiannis, Sean Meyn

Abstract: The paper concerns convergence and asymptotic statistics for stochastic approximation driven by Markovian noise: θ_{n+1}=θ_n+α_{n+1}f(θ_n,Φ_{n+1}),n≥0, in which each θ_n∈ℜ^d, {Φ_n} is a Markov chain on a general state space X with stationary distribution π, and f:ℜ^d×X→ℜ^d. In addition to standard Lipschitz bounds on f, and conditions on the vanishing step-size sequence {α_n}, it is assumed that the associated ODE is globally asymptotically stable with stationary point denoted θ*, where f¯(θ)=E[f(θ,Φ)] with Φ∼π. Moreover, the ODE@∞ defined with respect to the vector field, f¯∞(θ):=lim_{r→∞}r^{−1}f¯(rθ),θ∈ℜ^d, is asymptotically stable. The main contributions are summarized as follows:

(i) The sequence θ is convergent if Φ is geometrically ergodic, and subject to compatible bounds on f.

The remaining results are established under a stronger assumption on the Markov chain: A slightly weaker version of the Donsker-Varadhan Lyapunov drift condition known as (DV3).

(ii) A Lyapunov function is constructed for the joint process {θ_n,Φ_n} that implies convergence of {θ_n} in L4.

(iii) A functional CLT is established, as well as the usual one-dimensional CLT for the normalized error zn:=(θ_n−θ*)/\sqrt{α_n}. Moment bounds combined with the CLT imply convergence of the normalized covariance, lim_{n→∞}E[z_nz^T_n]=Σ_θ, where Σ_θ is the asymptotic covariance appearing in the CLT.

(iv) An example is provided where the Markov chain Φ is geometrically ergodic but it does not satisfy (DV3). While the algorithm is convergent, the second moment is unbounded.

Speaker Bio: Dr. Sean Meyn received the B.A. degree in mathematics from the University of California, Los Angeles (UCLA), in 1982 and the Ph.D. degree in electrical engineering from McGill University, Canada, in 1987 (with Prof. P. Caines, McGill University). He is now Professor and Robert C. Pittman Eminent Scholar Chair in the Department of Electrical and Computer Engineering at the University of Florida, the director of the Laboratory for Cognition & Control, and director of the Florida Institute for Sustainable Energy. His academic research interests include theory and applications of decision and control, stochastic processes, and optimization. He has received many awards for his research on these topics, and is a fellow of the IEEE.

## Tuesday 15th February 2022, 6pm UTC

Speaker: Andrew Wagenmaker (U. Washington)

Title: First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Paper: https://arxiv.org/abs/2112.03432

Slides: here

More details:

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Authors: Andrew Wagenmaker, Yifang Chen, Max Simchowitz, Simon S. Du, Kevin Jamieson

Abstract: Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as O(\sqrt{V*_1K}) in reinforcement learning with large state spaces, namely the linear MDP setting. Here V*_1 is the value of the optimal policy and K is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.

Speaker Bio: Andrew is a fourth-year Ph.D. student in Computer Science at the University of Washington working with Kevin Jamieson. His research interests are in active learning, reinforcement learning, and control theory. Andrew is supported by an NSF Graduate Research Fellowship.

Previously, Andrew completed a master's and bachelor's degree at the University of Michigan, both in Electrical Engineering. While at the University of Michigan, he worked with Raj Rao Nadakuditi and Necmiye Ozay.