Publications

Note: Click on the dropdown option on each work to know the gist.

Preprints

Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick, "Off-Policy Evaluation Using Information Borrowing and Context-Based Switching". Under review, December 2021. [Supplementary] [Code]

Gist of this work: We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh, "Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage". Under review, October 2023. [Supplementary] [Code]

Gist of this work: The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.

Conference Proceedings

Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh, "Distributionally Robust Behavioral Cloning for Robust Imitation Learning". Accepted to IEEE Conference on Decision and Control (CDC), December 2023. [Supplementary]

Gist of this work: Robust reinforcement learning (RL) aims to learn a policy that can withstand uncertainties in model parameters, which often arise in practical RL applications due to modeling errors in simulators, variations in real-world system dynamics, and adversarial disturbances. This paper introduces the robust imitation learning (IL) problem in a Markov decision process (MDP) framework where an agent learns to mimic an expert demonstrator that can withstand uncertainties in model parameters without additional online environment interactions. The agent is only provided with a dataset of state-action pairs from the expert on a single (nominal) dynamics, without any information about the true rewards from the environment. Behavioral cloning (BC), a supervised learning method, is a powerful algorithm to address the vanilla IL problem. We propose an algorithm for the robust IL problem that utilizes distributionally robust optimization (DRO) with BC. We call the algorithm DR-BC and show its robust performance against parameter uncertainties both in theory and in practice. We also demonstrate the empirical performance of our approach to addressing model perturbations on several MuJoCo continuous control tasks.

Jessica Maghakian, Paul Mineiro, Kishan Panaganti*, Mark Rucker, Akanksha Saran, Cheng Tan, "Personalized Reward Learning with Interaction-Grounded Learning (IGL)". Accepted (31.8% acceptance rate) to International Conference on Learning Representations, April 2023. [Supplementary]

Gist of this work: In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for a fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than taking a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.

*Alphabetical order

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil, "Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning". Accepted (29% acceptance rate) to Artificial Intelligence and Statistics, April 2023. [Supplementary] [Code]

Gist of this work: We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We formulate this as a distributionally robust reinforcement learning (DR-RL) problem where the objective is to learn the policy which maximizes the value function against the worst possible stochastic model of the environment in an uncertainty set. We focus on the tabular episodic learning setting where the algorithm has access to a generative model of the nominal (training) environment around which the uncertainty set is defined. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences: total variation, chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm achieves $O(|S||A|H^{5})$ sample complexity, which is uniformly better than the existing results by a factor of $|S|$, where $|S|$ is number of states, $|A|$ is the number of actions, and $H$ is the horizon length. We also provide the first-ever sample complexity result for the Wasserstein uncertainty set. Finally, we demonstrate the performance of our algorithm using simulation experiments.

*Equal contributions

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh, "Robust Reinforcement Learning using Offline Data". Accepted to Neural Information Processing Systems (NeurIPS), December 2022. [Supplementary] [Code]

Gist of this work: The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Sample Complexity of Robust Reinforcement Learning with a Generative Model". Accepted to Artificial Intelligence and Statistics, March 2022. [Paper] [Supplementary] [Code]

Gist of this work: The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an \epsilon-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Sample Complexity of Model-Based Robust Reinforcement Learning". Accepted to IEEE Conference on Decision and Control (CDC), December 2021. [Paper] [Supplementary] [Code]

Gist of this work: The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an \epsilon-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees". Accepted to the Thirty-eighth International Conference on Machine Learning (ICML), July 2021. [Paper] [Supplementary]

Gist of this work: We address the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDP framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some standard benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Bounded Regret for Finitely Parameterized Multi-Armed Bandits". Accepted to IEEE Conference on Decision and Control (CDC), December 2020. [Paper] [Supplementary]

Gist of this work: We developed an algorithm that can exploit the structures of the underlying system for achieving optimal and data-efficient performance. We formulated this problem as a finitely parameterized multi-armed bandits problem where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We proposed an algorithm that is simple and easy to implement, which we call FP-UCB algorithm, which uses the information about the underlying parameter set for faster learning. We characterized the performance of this algorithm using the metric of regret. The regret is a notion that quantifies how far the learning algorithm is from the optimal decision at each time-step of the learning process. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard RL algorithms. We also validated the superior performance of the FP-UCB algorithm through extensive experiments.

Workshops

Jessica Maghakian, Kishan Panaganti, Paul Mineiro, Akanksha Saran, Cheng Tan, "Interaction-Grounded Learning for Recommendation Systems". Appeared in Online Recommender Systems and User Modeling ACM RecSys 2022 Workshop. [Paper]

Gist of this work: Recommender systems have long grappled with optimizing user satisfaction using only implicit user feedback. Many approaches in the literature rely on complicated feedback modeling and costly user studies. We propose online recommender systems as a candidate for the recently introduced Interaction Grounded Learning (IGL) paradigm. In IGL, a learner attempts to optimize a latent reward in an environment by observing feedback with no grounding. We introduce a novel personalized variant of IGL for recommender systems that can leverage explicit and implicit user feedback to maximize user satisfaction, with no feedback signal modeling and minimal assumptions. With our empirical evaluations that include simulations as well as experiments on real product data, we demonstrate the effectiveness of IGL for recommender systems.

Kishan Panaganti, Dileep Kalathil, "Model-Free Robust Reinforcement Learning with Linear Function Approximation". Appeared in Challenges of Real-World RL NeurIPS 2020 Workshop. [Preprint]

Gist of this work: We developed an RL algorithm called Robust Least Squares Policy Iteration (RLSPI) algorithm that learns the optimal policy which is robust against parameter mismatches. We addressed this problem using the Robust Markov Decision Process (RMDP) formulation, which considers a set of model parameters (uncertainty set) under the assumption that the actual real-world parameters lie in this uncertainty set. The proposed learning algorithm finds a policy that performs best under the worst model parameter in the uncertainty set, and hence this policy is robust to the parameter mismatch (sim-2-real) between the training and testing environments. We focused on the problem with very large state spaces using the approach of least squares based online model-free reinforcement learning with linear function approximation. We also demonstrated the performance of the RLSPI algorithm on various Reinforcement Learning test environments from OpenAI Gym CartPole, MountainCar, Acrobot, FrozenLake environments.

Journals

Kishan Panaganti, Dileep Kalathil, "Bounded Regret for Finitely Parameterized Multi-Armed Bandits". Published in IEEE Control Systems Letters, July 2020. [Paper] [Supplementary]

Gist of this work: We developed an algorithm that can exploit the structures of the underlying system for achieving optimal and data-efficient performance. We formulated this problem as a finitely parameterized multi-armed bandits problem where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We proposed an algorithm that is simple and easy to implement, which we call FP-UCB algorithm, which uses the information about the underlying parameter set for faster learning. We characterized the performance of this algorithm using the metric of regret. The regret is a notion that quantifies how far the learning algorithm is from the optimal decision at each time-step of the learning process. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard RL algorithms. We also validated the superior performance of the FP-UCB algorithm through extensive experiments.