Policy Optimization Method Towards 

Optimal-time Stability

Abstract

In current model-free reinforcement learning (RL) algorithms, stability criteria based on sampling methods are commonly utilized to guide policy optimization. However, these criteria only guarantee the infinite-time convergence of the system’s state to an equilibrium point, which leads to sub-optimality of the policy. In this paper, we propose a policy optimization technique incorporating sampling-based Lyapunov stability. Our approach enables the system’s state to reach an equilibrium point within an optimal time and maintain stability thereafter, referred to as "optimal-time stability". To achieve this, we integrate the optimization method into the Actor-Critic framework, resulting in the development of the Adaptive Lyapunov-based Actor-Critic (ALAC) algorithm. Through evaluations conducted on ten robotic tasks, our approach outperforms previous studies significantly, effectively guiding the system to generate stable patterns.

Overview

We further made the following contributions:

We present GIFs below to show the architecture of ALAC.

State Stability Visualisation

We use t-SNE to illustrate the system's stability learned by ALAC in 3D. In dynamical systems theory, a system's phase space can be represented as a sign of stability. Thus, we also show various phase space trajectories to analyze the form of stability. Finally, we show convergence to a single point or circle for the state and phase trajectory.

 Performance Evaluation and Comparison 

The following environments were used in experimentations, and the figures show the accumulated cost and constraint violation comparison with other algorithms.

Overview of Environments

We also show the performance variation of ALAC on the different structures of the actor model and, in comparison, the performance using SAC-Cost on various actor model structures. Also, SAC-Cost does not converge as well or at all compared to ALAC.

Ablation Studies

Variation of performance with ALAC based on different stability certifications on different tasks.

Generalisation

Experiments show ALAC achieving excellent generalization with the feedback of error, however being of negative impact on the performance of SAC-Cost in comparison.

Robustness

We verify that ALAC achieves excellent robustness on most tasks. It is worth noting that we introduce periodic external disturbances with different magnitudes in each task.