Our recent work

Switching System Analysis of Q-learning

Donghwan Lee, "Final iteration convergence bound of Q-Learning: switching system approach," IEEE Transactions on Automatic Control, vol. 69, no. 7, 2024
Han-Dong Lim, Donghwan Lee, "Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view" IEEE Access, 2024
Narim Jeong, Donghwan Lee, "Finite-time error analysis of soft Q-learning: switching system approach," IEEE CDC2024 [link]
Donghwan Lee and Niao He, ``A unified switching system perspective and convergence analysis of Q-learning algorithms,'' NeurIPS2020 [link] [Online extension].
Donghwan Lee, Jianghai Hu, and Niao He, “A discrete-time switching system analysis of Q-learning,” submitted [link]

This study presents a novel approach to proving the convergence of Q-learning, one of the most prominent reinforcement learning algorithms. By modeling Q-learning as a switching system in control theory, we analyzed and proved its convergence using control theory techniques, offering a completely different perspective from existing approaches. The results of this paper were presented at NeurIPS 2020.

Regularized Q-learning

Han-Dong Lim, Donghwan Lee, "Regularized Q-learning," NeurIPS2024

Q-learning is widely used algorithm in reinforcement learning (RL) community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm, called RegQ, that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. Its stability is established using a recent analysis tool based on switching system models. Moreover, we experimentally show that RegQ converges in environments where Q-learning with linear function approximation has known to diverge. An error bound on the solution where the algorithm converges is also given.

Control System Analysis of Reinforcement Learning

Donghwan Lee, "Final iteration convergence bound of Q-Learning: switching system approach," IEEE Transactions on Automatic Control, vol. 69, no. 7, 2024
Han-Dong Lim, Donghwan Lee, "Backstepping temporal-difference learning," ICLR2023, Kigali, Rwanda, May 1-5, 2023
Han-Dong Lim, Donghwan Lee, "Regularized Q-learning," NeurIPS2024
Han-Dong Lim, Donghwan Lee, "Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view" IEEE Access, 2024
Donghwan Lee and Niao He, ``A unified switching system perspective and convergence analysis of Q-learning algorithms,'' NeurIPS2020 [link] [Online extension].
Donghwan Lee, Jianghai Hu, and Niao He, “A discrete-time switching system analysis of Q-learning,” submitted [link]

Various reinforcement learning algorithms, particularly the convergence of TD-learning, are analyzed using linear or nonlinear system modeling. Through this approach, new reinforcement learning algorithms are developed, offering new perspectives and interpretations.

Backstepping TD-Learning

Han-Dong Lim, Donghwan Lee, "Backstepping temporal-difference learning," ICLR2023, Kigali, Rwanda, May 1-5, 2023

Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD learning algorithms have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective. Our method relies on the backstepping technique, which is widely used in nonlinear control theory.

Dummy Adversarial Q-Learning

Hyeann Lee, Donghwan Lee, "Suppressing overestimation in Q-learning through adversarial behaviors," Annual Allerton Conference on Communication, Control, and Computing, 2024 [link]

The goal of this paper is to propose a new Qlearning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a twoplayer zero-sum game. The proposed DAQ unifies several Qlearning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias through dummy adversarial behaviors and can be easily applied to off-theshelf value-based reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Qlearning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.

TD-Learning with Replay Buffer

Han-Dong Lim, Donghwan Lee, "Finite-time analysis of temporal difference learning with experience replay," Transactions on Machine Learning Research (TMLR), 2024 [link]

Temporal-difference (TD) learning is widely regarded as one of the most popular algorithms in reinforcement learning (RL). Despite its widespread use, it has only been recently that researchers have begun to actively study its finite time behavior, including the finite time bound on mean squared error and sample complexity. On the empirical side, experience replay has been a key ingredient in the success of deep RL algorithms, but its theoretical effects on RL have yet to be fully understood. In this paper, we present a simple decomposition of the Markovian noise terms and provide finite-time error bounds for tabular on-policy TD-learning with experience replay. Specifically, under the Markovian observation model, we demonstrate that for both the averaged iterate and final iterate cases, the error term induced by a constant step-size can be effectively controlled by the size of the replay buffer and the mini-batch sampled from the experience replay buffer.

Saddle Point Perspective of Reinforcement Learning

Various reinforcement learning problems can be transformed into optimization and saddle point problems, after which they can be solved using a variety of techniques designed for optimization problems.

Donghwan Lee, Han-Dong Lim, Jihoon Park, and Okyong Choi, "New versions of gradient temporal-difference learning," IEEE Transactions on Automatic Control [link]
Donghwan Lee and Niao He, ``Stochastic primal-dual Q-learning,'' [link]

New Versions of Temporal-Difference Learning

Donghwan Lee, Han-Dong Lim, Jihoon Park, and Okyong Choi, "New versions of gradient temporal-difference learning," IEEE Transactions on Automatic Control 2023 [link]

Sutton, Szepesvári and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this article is 1) to propose some variants of GTDs with extensive comparative analysis and 2) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex–concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal–dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate the new approaches.

Target-Based TD-Learning

Donghwan Lee and Niao He , "Target-based temporal difference learning," ICML2019, Long beach, CA, June 11-15, 2019 [link] [online extension].

The use of target networks has been a popular and key component of recent deep Q-learning algorithms for reinforcement learning, yet little is known from the theory side. In this work, we introduce a new family of target-based temporal difference (TD) learning algorithms that maintain two separate learning parameters: the target variable and online variable. We propose three members in the family, the averaging TD, double TD, and periodic TD, where the target variable is updated through an averaging, symmetric, or periodic fashion, respectively, mirroring those techniques used in deep Q-learning practice. We establish asymptotic convergence analyses for both averaging TD and double TD and a finite sample analysis for periodic TD. In addition, we provide some simulation results showing potentially superior convergence of these target-based TD algorithms compared to the standard TD-learning. While this work focuses on linear function approximation and policy evaluation setting, we consider this as a meaningful step towards the theoretical understanding of deep Q-learning variants with target networks.

Multi-Agent Reinforcement Learning

Donghwan Lee, Do Wan Kim, Jianghai Hu, "Distributed off-policy temporal difference learning using primal-dual method," IEEE Access, vol. 10, 2022

The goal of this paper is to provide theoretical analysis and additional insights on a distributed temporal-difference (TD)-learning algorithm for the multi-agent Markov decision processes (MDPs) via saddle-point viewpoints. The (single-agent) TD-learning is a reinforcement learning (RL) algorithm for evaluating a given policy based on reward feedbacks. In multi-agent settings, multiple RL agents concurrently behave, and each agent receives its local rewards. The goal of each agent is to evaluate a given policy corresponding to the global reward, which is an average of the local rewards by sharing learning parameters through random network communications. In this paper, we propose a distributed TD-learning based on saddle-point frameworks, and provide rigorous analysis of finite-time convergence of the algorithm and its solution based on tools in optimization theory. The results in this paper provide general and unified perspectives of the distributed policy evaluation problem, and theoretically complement the previous works.

Page updated

Google Sites

Report abuse