Selected Topics

RL and Control with Exogenous Distractors

with Alekh Agarwal, Rajan Chari, Aniket Didolkar, Dylan Foster, Riashat Islam, Alex Lamb, John Langford, Dipendra Misra, Lekan Molu, Sham Kakade, Akshay Krishnamurthy, Cyril Zhang

The state-space of many RL and control problems contains information on an uncontrollable and time-correlated process, which we refer to as exogenous distractors. For example, visual sensors of an autonomous car may detect irrelevant information on a flock of birds. Nevertheless, we would hope, the complexity of learning to drive would be independent of the complexity of learning the movement of the flock. In this line of work, we aim to study algorithms for this setting that have no or mild dependence on the complexity of the exogenous noise process.

Provable RL with Exogenous Distractors via Multistep Inverse Dynamics (Oral talk, ICLR22)
Sparsity in the Partially Controllable Linear Systems (ICML22, Contributed talk, RL theory workshop, ICML21)
Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information (COLT22)Guaranteed Discovery of Controllable Latent States with Multi-Step Inverse Models

Sample Complexity of MDPs with Latent Context

with Constantine Caramanis, Jeongyoel Kwon, Shie Mannor

We focus on a natural class of POMDPs in which exists a hidden context, not directly observed as a part of the state space, that we refer to as latent MDPs (LMDPs). We study several aspects of this problem, (i) we establish a statistical separation between the scenario the hidden context is given at the end of each episode and the scenario in which it is not given, (ii) study several assumptions for which this class of problems is learnable with polynomial sample complexity, and (iii) study a sub-class of LMDPs and study a provably efficient algorithm under no assumptions.

RL for Latent MDPs: Regret Guarantees and a Lower Bound (Spotlight, NeurIPS21)
RL in reward-mixing MDPs (NeurIPS21)

Does Offline Data Improve Online Algorithms?

with Shie Mannor, Uri Shalit, Guy Tennenholtz

Motivated by this question, we study linear bandits supplied with offline data in which the feature vectors are only partially observed. We then suggest a simple approach, based on geometric intuition, that can utilize the offline data to improve the performance of the online algorithm.

Bandits with Partially Observable Confounded Data (Video) (UAI2021)

Sequential Decision Making with a Feedback Budget

with Shie Mannor, Nadav Merlis, Aadirupa Saha

A recurring theme in applying sequential decision-making algorithms is the lack of feedback. Indeed, in most applications in which a human supplies the reward information there is some feedback limitation. Motivated by this realization, we formulated the sequential budgeted learning setting in which there is a limitation on the available reward feedback. We study a simple principled approach -- we refer to as confidence-budget matching -- for this setting and applied it in RL and contextual bandit problems. Towards a deeper understanding of the setting, we also studied problem-dependent guarantees for the multi-armed bandit problem in the presence of feedback constraints.

Confidence-Budget Matching for Sequential Budgeted Learning (ICML2021)
Dare not to Ask: Problem-Dependent Guarantees for Budgeted Bandits

Policy optimization for RL

with Mohammad Ghavamzadeh, Shie Mannor, Matteo Pirrota, Aviv Rosenberg, Lior Shani, Manan Tomar.

Motivated by the recent success of policy optimization in RL and safe RL we established convergence guarantees to policy optimization techniques. Specifically, we analyzed the convergence of the TRPO algorithm (Schulmann et al. 2015) and showed it has similar performance guarantees as to the Conservative Policy Iteration algorithm (Kakade and Langford 2002). Additionally, we considered the exploration problem and derived performance guarantees of mirror-descent type of algorithm for MDPs and constraint MDPs.

Adaptive TRPO: Global Convergence and Faster Rates for Regularized MDPs (Oral AAAI2020, Best Paper Award NeurIPS2019 OptRL Workshop)Optimistic Policy Optimization with Bandit Feedback (ICML2020)Exploration-Exploitation in Constrained MDPsMirror Descent Policy Optimization (ICLR22)

Lookahead policies: when, why, how?

with Gal Dalal, Mohammad Ghavamzadeh, Shie Mannor, Nadav Merlis, Bruno Scherrer, Manan Tomar

Lookahead policies are a widely used tool in practical applications for several decades -- from Samuel's checkers play program to Alpha-Go. Motivated by these successes, we studied lookahead policies from several perspectives and addressed when, why and how lookahead policies should be used. Furthermore, we established a close relation between lookahead policies and reward shaping, which allowed us to scale multi-step greedy algorithms to deep RL.

Beyond the One Step Greedy Approach in RL (Long Talk. ICML 2018)Multiple-Step Greedy Policies in Online and Approximate RL (Spotlight, NeurIPS 2018)How to Combine Tree-Search Methods in RL (Outstanding Paper Award AAAI 2019)Tight Regret Bounds for Model-Based RL with Greedy Policies (Spotlight, NeurIPS 2019)Online Planning with Lookahead Policies (NeurIPS 2020)
Multi-step Greedy RL Algorithms (ICML2020)