Kishan Panaganti Badrinath

Selected Publications

Please see Google Scholar or DBLP for a full list of publications.

Note: Click on the dropdown option on each work below for the abstract.

Doctoral Dissertation

Kishan Panaganti, "Robust Reinforcement Learning: Theory and Algorithms". Texas A&M University, 2023. Available electronically from this link.

Abstract: This research dissertation explores novel algorithms in the field of robust reinforcement learning (RL) that address the challenges of controlling dynamical systems in real-world scenarios. Classical reinforcement learning is a powerful sub-field in machine learning for training intelligent sequential decision-making agents in complex environments. However, these algorithms often face challenges when it comes to uncertainties and variations in the environment, as well as the requirement for a large number of training samples. In this work, we present novel robust reinforcement learning algorithms that address these challenges. Our algorithms focus on robustness to uncertainties in the environment through the transition dynamics variations. By leveraging techniques such as distributionally robust optimization, our algorithms aim to learn policies that can withstand these uncertainties. We study robust reinforcement learning in online environment interactions setting as well as when we are given historical without access to the environment. We also study the problems of imitation learning and offline reinforcement learning that is relevant for real-world applications where the goals differ from robust reinforcement learning, but we see the tools of distributionally robust optimization and model pessimism involves crucial roles in helping improve these areas of learning domain. The experimental results demonstrate the effectiveness of our robust algorithms, showcasing their potential for real-world applications where uncertainties and variations are prevalent.

Masters Dissertation

Kishan Panaganti, "The Statistical Physics of Load Balancing in Network Systems". Indian Institute of Science, Bengaluru, India, 2017. Available at this link.

Abstract: This research dissertation explores the large deviation principles of the belief propagation algorithm for a structured load balancing problem.

Preprints

Sutanoy Dasgupta, Yabo Niu, Kishan Panaganti, Dileep Kalathil, Debdeep Pati, Bani Mallick, "Off-Policy Evaluation Using Information Borrowing and Context-Based Switching". Under review, December 2021, updated August 2024. [Supplementary] [Code]

Abstract: We consider the off-policy evaluation (OPE) problem in contextual bandits, where the goal is to estimate the value of a target policy using the data collected by a logging policy. Most popular approaches to the OPE are variants of the doubly robust (DR) estimator obtained by combining a direct method (DM) estimator and a correction term involving the inverse propensity score (IPS). Existing algorithms primarily focus on strategies to reduce the variance of the DR estimator arising from large IPS. We propose a new approach called the Doubly Robust with Information borrowing and Context-based switching (DR-IC) estimator that focuses on reducing both bias and variance. The DR-IC estimator replaces the standard DM estimator with a parametric reward model that borrows information from the 'closer' contexts through a correlation structure that depends on the IPS. The DR-IC estimator also adaptively interpolates between this modified DM estimator and a modified DR estimator based on a context-specific switching rule. We give provable guarantees on the performance of the DR-IC estimator. We also demonstrate the superior performance of the DR-IC estimator compared to the state-of-the-art OPE algorithms on a number of benchmark problems.

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran, "Distributionally Robust Direct Preference Optimization". Under review, February 2025. [Supplementary]

Abstract: A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback–Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

Working papers

Kishan Panaganti, Eric Mazumdar, Adam Wierman, "Cooperative Multi-agent Robust Reinforcement Learning". Manuscript under preparation, 2025.

Journals

Kishan Panaganti, Dileep Kalathil, "Bounded Regret for Finitely Parameterized Multi-Armed Bandits". Published in IEEE Control Systems Letters, July 2020. [Paper] [Supplementary]

Abstract: We developed an algorithm that can exploit the structures of the underlying system for achieving optimal and data-efficient performance. We formulated this problem as a finitely parameterized multi-armed bandits problem where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We proposed an algorithm that is simple and easy to implement, which we call FP-UCB algorithm, which uses the information about the underlying parameter set for faster learning. We characterized the performance of this algorithm using the metric of regret. The regret is a notion that quantifies how far the learning algorithm is from the optimal decision at each time-step of the learning process. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard RL algorithms. We also validated the superior performance of the FP-UCB algorithm through extensive experiments.

Conference Proceedings

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh, "Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage". Accepted to the 7th Annual Learning for Dynamics & Control Conference, June 2025. [Supplementary] [Code]

Abstract: The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.

Chengrui Qu, Laixi Shi, Kishan Panaganti, Pengcheng You, Adam Wierman, "Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data". Accepted as oral (~2% oral acceptance rate) to Artificial Intelligence and Statistics, May 2025. [Paper] [Supplementary]

Abstract: Online Reinforcement learning (RL) typically requires high-stakes online interaction data to learn a policy for a target task. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that -- without information on the dynamics shift -- general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL. Finally, our experimental results demonstrate that HySRL surpasses state-of-the-art online RL baseline.

Eric Mazumdar, Kishan Panaganti, Laixi Shi, "Tractable Equilibrium Computation in Markov Games through Risk Aversion". Accepted as oral (~1.8% of 11500 submitted works) to International Conference on Learning Representations, April 2025. [Paper] [Supplementary]

Abstract: A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that—by imbuing agents with important features of human decision-making like risk aversion and bounded rationality—a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all n-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents’ degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples’ patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.

Zhengfei Zhang, Kishan Panaganti, Laixi Shi, Yanan Sui, Adam Wierman, Yisong Yue, "Distributionally Robust Constrained Reinforcement Learning under Strong Duality". Accepted to the first Reinforcement Learning Conference (RLC), August 2024. [Paper] [Supplementary]

Abstract: We study the problem of Distributionally Robust Constrained RL (DRC-RL), where the goal is to maximize the expected reward subject to environmental distribution shifts and constraints. This setting captures situations where training and testing environments differ, and policies must satisfy constraints motivated by safety or limited budgets. Despite significant progress toward algorithm design for the separate problems of distributionally robust RL and constrained RL, there does not yet exist algorithms with end-to-end convergence guarantees for DRC-RL. We develop an algorithmic framework based on strong duality that enables the first efficient, provable solution under a class of environmental uncertainties. Further, our framework exposes a structural characterization of DRC-RL that arises from the combination of distributional robustness and constraints and prevents a popular class of iterative methods from tractably solving DRC-RL, despite such frameworks being applicable for each of distributionally robust RL and constrained RL individually. Finally, we conduct a focused experiment on a car racing benchmark to evaluate the effectiveness of the proposed algorithm.

Kishan Panaganti, Adam Wierman, Eric Mazumdar, "Model-Free Robust ϕ-Divergence Reinforcement Learning Using Both Offline and Online Data". Accepted to the Forty-first International Conference on Machine Learning, July 2024. [Paper] [Supplementary]

Abstract: The robust ϕ-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust ϕ-regularized fitted Q-iteration (RPQ) for learning an ϵ-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of ϕ-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust ϕ-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust ϕ-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

Kishan Panaganti*, Zaiyan Xu*, Dileep Kalathil, Mohammad Ghavamzadeh, "Distributionally Robust Behavioral Cloning for Robust Imitation Learning". Accepted to the IEEE Conference on Decision and Control, December 2023. [Paper] [Supplementary]

Abstract: Robust reinforcement learning (RL) aims to learn a policy that can withstand uncertainties in model parameters, which often arise in practical RL applications due to modeling errors in simulators, variations in real-world system dynamics, and adversarial disturbances. This paper introduces the robust imitation learning (IL) problem in a Markov decision process (MDP) framework where an agent learns to mimic an expert demonstrator that can withstand uncertainties in model parameters without additional online environment interactions. The agent is only provided with a dataset of state-action pairs from the expert on a single (nominal) dynamics, without any information about the true rewards from the environment. Behavioral cloning (BC), a supervised learning method, is a powerful algorithm to address the vanilla IL problem. We propose an algorithm for the robust IL problem that utilizes distributionally robust optimization (DRO) with BC. We call the algorithm DR-BC and show its robust performance against parameter uncertainties both in theory and in practice. We also demonstrate the empirical performance of our approach to addressing model perturbations on several MuJoCo continuous control tasks.

Jessica Maghakian, Paul Mineiro, Kishan Panaganti*, Mark Rucker, Akanksha Saran, Cheng Tan, "Personalized Reward Learning with Interaction-Grounded Learning (IGL)". Accepted (31.8% acceptance rate) to International Conference on Learning Representations, April 2023. [Paper] [Supplementary]

Abstract: In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for a fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than taking a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.

*Alphabetical order

Zaiyan Xu*, Kishan Panaganti*, Dileep Kalathil, "Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning". Accepted (29% acceptance rate) to Artificial Intelligence and Statistics, April 2023. [Paper] [Supplementary] [Code]

Abstract: We consider the problem of learning a control policy that is robust against the parameter mismatches between the training environment and testing environment. We formulate this as a distributionally robust reinforcement learning (DR-RL) problem where the objective is to learn the policy which maximizes the value function against the worst possible stochastic model of the environment in an uncertainty set. We focus on the tabular episodic learning setting where the algorithm has access to a generative model of the nominal (training) environment around which the uncertainty set is defined. We propose the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the uncertainty sets specified by four different divergences: total variation, chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm achieves $O(|S||A|H^{5})$ sample complexity, which is uniformly better than the existing results by a factor of $|S|$, where $|S|$ is number of states, $|A|$ is the number of actions, and $H$ is the horizon length. We also provide the first-ever sample complexity result for the Wasserstein uncertainty set. Finally, we demonstrate the performance of our algorithm using simulation experiments.

*Equal contributions

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh, "Robust Reinforcement Learning using Offline Data". Accepted to Neural Information Processing Systems, December 2022. [Paper] [Supplementary] [Code]

Abstract: The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Sample Complexity of Robust Reinforcement Learning with a Generative Model". Accepted to Artificial Intelligence and Statistics, March 2022. [Paper] [Supplementary] [Code]

Abstract: The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an \epsilon-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Sample Complexity of Model-Based Robust Reinforcement Learning". Accepted to the IEEE Conference on Decision and Control, December 2021. [Paper] [Supplementary] [Code]

Abstract: The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an \epsilon-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees". Accepted to the Thirty-eighth International Conference on Machine Learning, July 2021. [Paper] [Supplementary]

Abstract: We address the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDP framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some standard benchmark problems.

Kishan Panaganti, Dileep Kalathil, "Bounded Regret for Finitely Parameterized Multi-Armed Bandits". Accepted to the IEEE Conference on Decision and Control, December 2020. [Paper] [Supplementary]

Abstract: We developed an algorithm that can exploit the structures of the underlying system for achieving optimal and data-efficient performance. We formulated this problem as a finitely parameterized multi-armed bandits problem where the model of the underlying stochastic environment can be characterized based on a common unknown parameter. The true parameter is unknown to the learning agent. However, the set of possible parameters, which is finite, is known a priori. We proposed an algorithm that is simple and easy to implement, which we call FP-UCB algorithm, which uses the information about the underlying parameter set for faster learning. We characterized the performance of this algorithm using the metric of regret. The regret is a notion that quantifies how far the learning algorithm is from the optimal decision at each time-step of the learning process. In particular, we show that the FP-UCB algorithm achieves a bounded regret under some structural condition on the underlying parameter set. We also show that, if the underlying parameter set does not satisfy the necessary structural condition, the FP-UCB algorithm achieves a logarithmic regret, but with a smaller preceding constant compared to the standard RL algorithms. We also validated the superior performance of the FP-UCB algorithm through extensive experiments.

Workshops

Eric Mazumdar, Kishan Panaganti, Laixi Shi, "A Behavioral Economics Approach to Principled Multi-Agent Reinforcement Learning". Appeared in NeurIPS Workshop on Behavioral Machine Learning, December 2024. [Paper] [Supplementary]

Abstract: A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that—by imbuing agents with important features of human decision-making like risk aversion and bounded rationality—a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all n-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents’ degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples’ patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh, "Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage". On Arxiv in October 2023. Appeared in ICML 2024 Workshop: Foundations of Reinforcement Learning and Control -- Connections and Perspectives [Paper] [Supplementary] [Code]

Abstract: The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.

Jessica Maghakian, Kishan Panaganti, Paul Mineiro, Akanksha Saran, Cheng Tan, "Interaction-Grounded Learning for Recommendation Systems". Appeared in Online Recommender Systems and User Modeling ACM RecSys 2022 Workshop. [Paper]

Abstract: Recommender systems have long grappled with optimizing user satisfaction using only implicit user feedback. Many approaches in the literature rely on complicated feedback modeling and costly user studies. We propose online recommender systems as a candidate for the recently introduced Interaction Grounded Learning (IGL) paradigm. In IGL, a learner attempts to optimize a latent reward in an environment by observing feedback with no grounding. We introduce a novel personalized variant of IGL for recommender systems that can leverage explicit and implicit user feedback to maximize user satisfaction, with no feedback signal modeling and minimal assumptions. With our empirical evaluations that include simulations as well as experiments on real product data, we demonstrate the effectiveness of IGL for recommender systems.

Kishan Panaganti, Dileep Kalathil, "Model-Free Robust Reinforcement Learning with Linear Function Approximation". Appeared in Challenges of Real-World RL NeurIPS 2020 Workshop. [Preprint]

Abstract: We developed an RL algorithm called Robust Least Squares Policy Iteration (RLSPI) algorithm that learns the optimal policy which is robust against parameter mismatches. We addressed this problem using the Robust Markov Decision Process (RMDP) formulation, which considers a set of model parameters (uncertainty set) under the assumption that the actual real-world parameters lie in this uncertainty set. The proposed learning algorithm finds a policy that performs best under the worst model parameter in the uncertainty set, and hence this policy is robust to the parameter mismatch (sim-2-real) between the training and testing environments. We focused on the problem with very large state spaces using the approach of least squares based online model-free reinforcement learning with linear function approximation. We also demonstrated the performance of the RLSPI algorithm on various Reinforcement Learning test environments from OpenAI Gym CartPole, MountainCar, Acrobot, FrozenLake environments.