Past Seminars - SUMMER 2023

Tuesday 18th JuLy 2023, 5PM UTC

Speaker:  Ayush Sekhari (MIT)

Title: On the Complexity of Adversarial Decision Making

Paper: https://arxiv.org/abs/2206.13063

Slides: here

More details:

On the Complexity of Adversarial Decision Making

Authors: Dylan J. Foster, Alexander Rakhlin, Ayush Sekhari, Karthik Sridharan

Abstract: A central problem in online learning and decision making -- from bandits to reinforcement learning -- is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show -- via new upper and lower bounds -- that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results -- both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy and the Exploration-by-Optimization objective of Lattimore and György.

Speaker Bio: Ayush is a postdoc researcher working with Prof. Sasha Rakhlin at MIT. He completed his Ph.D. from the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span optimization, online learning, reinforcement learning and control, and the interplay between them. Before coming to Cornell, he spent a year at Google as a part of the Brain residency program. Before Google, he completed his undergraduate studies in computer science at IIT Kanpur in India where he was awarded the President's Gold medal. Ayush has also been a winner of a student best paper award at COLT 2019.

Tuesday 11th JuLy 2023, 5PM UTC

Speaker:  Yann Ollivier (Meta AI Research)

Title: Does Zero-Shot Reinforcement Learning Exist?

Paper: https://arxiv.org/abs/2209.14935 

Slides: here

More details:

Does Zero-Shot Reinforcement Learning Exist?

Authors: Ahmed Touati, Jérémy Rapin, Yann Ollivier

Abstract: A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) [BBQ+ 18] or forward-backward (FB) representations [TO21], but testing has been limited.

After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark [LYL+21]. To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers.

SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

Speaker Bio: Yann Ollivier is a research scientist at Meta FAIR Labs in Paris. After studying at Ecole normale superieure, he worked for several years on probability theory, differential geometry, and group theory at the Centre national de la recherche scientifique. Around 2010 he moved to machine learning, with a focus on recurrent models, stochastic optimization, and reinforcement learning. He joined Meta's AI lab in 2017 to work mostly on reinforcement learning.

Tuesday 4th JuLy 2023, 5PM UTC

Speaker: Qinghua Liu (Princeton University)

Title: Optimistic MLE -- A Generic Model-based Algorithm for Partially Observable Sequential Decision Making

Paper: https://arxiv.org/abs/2209.14997

Slides: here

More details:

Optimistic MLE -- A Generic Model-based Algorithm for Partially Observable Sequential Decision Making

Authors: Qinghua Liu, Praneeth Netrapalli, Csaba Szepesvári, Chi Jin

Abstract: This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable.

Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.

Speaker Bio: Qinghua Liu is a Ph.D. candidate in the Department of Electrical and Computer Engineering at Princeton University. His PhD advisor is Prof. Chi Jin. His research interests focus on reinforcement learning theory, in particular, RL problems related to function approximation, partial observability and game theory. During summer 2022, he was a research scientist intern at DeepMind, London, working with Csaba Szepesvári and Gellért Weisz. Previously, he received a B.E. degree in Electrical Engineering and a B.S. degree in Mathematics from Tsinghua University in 2018.

Tuesday 27th June 2023, 5PM UTC

Speaker: Kefan Dong (Stanford University)

Title: Asymptotic Instance-Optimal Algorithms for Interactive Decision Making

Paper: https://arxiv.org/abs/2206.02326

Slides: here

More details:

Asymptotic Instance-Optimal Algorithms for Interactive Decision Making

Authors: Kefan Dong, Tengyu Ma

Abstract: Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm's performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance f, our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret C(F)ln(n), where C(f) is an exact characterization of the complexity of f. The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity C(f) is the minimum cost to test the instance f against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits [Lai and Robbins, 1985] and prior works on linear bandits [Lattimore and Szepesvari, 2017], and improve upon the previous best instance-dependent upper bound [Xu et al., 2021] for reinforcement learning.

Speaker Bio: Kefan is a third-year Ph.D. student at Stanford University, advised Prof. Tengyu Ma. Prior to that, he received his bachelor's degree from Yao class, Tsinghua University. He is broadly interested in reinforcement learning theory, online learning, and algorithm design.

Tuesday 20th June 2023, 5PM UTC

Speaker: Han Zhong (Peking University)

Title: GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

Paper: https://arxiv.org/abs/2211.01962 

Slides: here

More details:

GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond

Authors: Han Zhong, Wei Xiong, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang, Tong Zhang

Abstract: We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs and PSRs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.

Speaker Bio: Han is a second-year Ph.D. student at Peking University. Before that, he completed his undergraduate studies in Mathematics at the University of Science and Technology of China. Han's research interests lie in the area of machine learning, with a particular focus on reinforcement learning theory.

Tuesday 13th June 2023, 5PM UTC

Speaker: Zeyu Jia (MIT)

Title: Linear Reinforcement Learning with Ball Structure Action Space

Paper: https://arxiv.org/abs/2211.07419

Slides: here

More details:

Linear Reinforcement Learning with Ball Structure Action Space

Authors: Zeyu Jia, Randy Jia, Dhruv Madeka, Dean P. Foster

Abstract: We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known d-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a ``ball structure'' action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an ϵ-optimal policy using only Õ (H^5d^3/ϵ^3) number of trajectories.

Speaker Bio: Zeyu Jia is a third-year PhD student at Department of EECS at MIT, advised by Alexander Rakhlin and Yury Polyanskiy. He is also affiliated to Laboratory for Information and Decision System (LIDS) at MIT. In summer 2022, he was a research intern at Amazon. Prior to joining MIT, he received bachelor’s degrees in Peking University, School of Mathematical Science in 2020. His research interests include machine learning theory especially reinforcement learning theory, optimization and information theory.

Tuesday 6th June 2023, 5PM UTC

Speaker: Yujia Jin (Stanford University)

Title: VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Paper: https://arxiv.org/abs/2212.06069

Slides: here

More details:

VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation

Authors: Alekh Agarwal, Yujia Jin, Tong Zhang

Abstract: We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic Q-Learning (VOQL), based on Q-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VOQL achieves O~(dHT−−−√+d6H5) regret over T episodes for a horizon H MDP under (d-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.

Speaker Bio: Yujia Jin is a final-year Ph.D. candidate at Stanford, working with Aaron Sidford. She interned with the learning theory team at Google Research in 2022 summer, working with Tong Zhang and Alekh Agarwal. Her research interests include designing efficient sublinear-or-linear-time algorithms with applications to large-scale and foundational tasks in machine learning and reinforcement learning.

Tuesday 30th May 2023, 5PM UTC

Speaker: Ying Jin (Stanford University)

Title: Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

Paper: https://arxiv.org/abs/2212.09900

Slides: here

More details:

Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

Authors: Ying Jin, Zhimei Ren, Zhuoran Yang, Zhaoran Wang

Abstract: This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn the optimal individualized decision rule in a given class. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset. In other words, the performance of these methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities.

In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed by quantifying the estimation uncertainty of the augmented inverse propensity weighted (AIPW)-type estimators using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which depends only on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized concentration inequality for IPW estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.

Speaker Bio: 

Ying is a 4th year PhD student at Department of Statistics, Stanford University, advised by Emmanuel Candès and Dominik Rothenhäusler. Prior to Stanford, she obtained her BS in Mathematics from Tsinghua University. Ying’s current research centers around distribution-free inference, multiple hypotheses testing, causal inference, and data-driven decision making.