Mirror Descent Policy Optimization

Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh

Paper Code

Abstract

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). Inspired by such theoretical analyses, we propose an efficient RL algorithm, called mirror descent policy optimization (MDPO). MDPO iteratively updates the policy by approximately solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive on-policy and off-policy variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL.

Contributions


  • We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact not a necessity for high performance gains in TRPO.


  • We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO.


  • As a side result, we address another common belief--- that PPO is a better performing algorithm than TRPO. By reporting results of both a vanilla version and a version loaded with code-level optimization techniques for all algorithms, we show that in both cases, TRPO consistently performs better than PPO.


  • Through our on-policy and off-policy experiments, we show how MDPO manifests as a fundamental algorithm with high practical utility which achieves state-of-the-art performance across a number of benchmark tasks.

Algorithms

At each iteration k, solve the following trust-region problem by performing multiple steps of SGD

On-policy MDPO



Off-policy MDPO



On Multiple SGD Steps

Multiple steps of SGD are crucial for approximately solving each MD iterate, thus resulting in improved performance gains

On Enforcing a Hard Constraint

MDPO does not require enforcing a hard constraint (like in TRPO) and still performs better than TRPO in most tasks,

while consistently beating PPO by considerable margin across all tasks

On TRPO Performing Better Than PPO

Given the same code level optimizations, TRPO is a better performing algorithm than PPO.


Code level optimizations:


  • Observation normalization

  • Reward normalization

  • Value function clipping

  • Learning rate annealing

  • Orthogonal initialization of NN weights

  • Variance reduction using GAE

Without code level optimizations

With code level optimizations

On Close Connections to SAC

MDPO offers more generality to viewing SAC, while performing better than or on par to it.

More details are available on the Github page.

Contact

If you have any questions, problems, or suggestions for improvement, you can reach us via email or raise an issue on Github.