Abstract
Abstract
Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). Inspired by such theoretical analyses, we propose an efficient RL algorithm, called mirror descent policy optimization (MDPO). MDPO iteratively updates the policy by approximately solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive on-policy and off-policy variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL.
Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). Inspired by such theoretical analyses, we propose an efficient RL algorithm, called mirror descent policy optimization (MDPO). MDPO iteratively updates the policy by approximately solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive on-policy and off-policy variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL.
Contributions
Contributions
- We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact not a necessity for high performance gains in TRPO.
- We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO.
- As a side result, we address another common belief--- that PPO is a better performing algorithm than TRPO. By reporting results of both a vanilla version and a version loaded with code-level optimization techniques for all algorithms, we show that in both cases, TRPO consistently performs better than PPO.
- Through our on-policy and off-policy experiments, we show how MDPO manifests as a fundamental algorithm with high practical utility which achieves state-of-the-art performance across a number of benchmark tasks.
Algorithms
Algorithms
At each iteration k, solve the following trust-region problem by performing multiple steps of SGD
At each iteration k, solve the following trust-region problem by performing multiple steps of SGD
On-policy MDPO
On-policy MDPO
Off-policy MDPO
Off-policy MDPO
On Multiple SGD Steps
On Multiple SGD Steps
Multiple steps of SGD are crucial for approximately solving each MD iterate, thus resulting in improved performance gains
Multiple steps of SGD are crucial for approximately solving each MD iterate, thus resulting in improved performance gains
On Enforcing a Hard Constraint
On Enforcing a Hard Constraint
MDPO does not require enforcing a hard constraint (like in TRPO) and still performs better than TRPO in most tasks,
MDPO does not require enforcing a hard constraint (like in TRPO) and still performs better than TRPO in most tasks,
while consistently beating PPO by considerable margin across all tasks
while consistently beating PPO by considerable margin across all tasks
On TRPO Performing Better Than PPO
On TRPO Performing Better Than PPO
Given the same code level optimizations, TRPO is a better performing algorithm than PPO.
Given the same code level optimizations, TRPO is a better performing algorithm than PPO.
Code level optimizations:
Code level optimizations:
- Observation normalization
- Reward normalization
- Value function clipping
- Learning rate annealing
- Orthogonal initialization of NN weights
- Variance reduction using GAE
Without code level optimizations
With code level optimizations
On Close Connections to SAC
On Close Connections to SAC
MDPO offers more generality to viewing SAC, while performing better than or on par to it.
MDPO offers more generality to viewing SAC, while performing better than or on par to it.