## Policy Optimization

in Reinforcement Learning

*by **Sham Kakade**, **Martha White**, **Nicolas Le Roux*

*with help from **Alan Chan**, **Shivam Garg**, **Dhawal Gupta**, and **Abhishek Naik*

### presented at the **Conference on Neural Information Processing Systems (Neur****IPS) 2020**

**Conference on Neural Information Processing Systems (Neur**

**IPS) 2020**## Overview

This tutorial will cover policy gradients methods in reinforcement learning, with a focus on understanding foundational ideas from an optimization perspective. We will discuss the properties of the policy objective, in terms of two critical properties for convergence rates when using stochastic gradient approaches: variance and curvature. We will explain how the policy objective can be a particularly difficult optimization problem, as it can have large flat regions and stochastic samples of the gradient can be very high variance. We will first explain how to use standard tools from optimization to reduce the variance of the gradient estimate, as well as techniques to mitigate curvature issues. We will then discuss optimization improvements that leverage more knowledge about the objective, including the Markov property and how to modify the state distribution for more coverage. We will discuss how standard Actor-Critic methods with (off-policy) data re-use provide RL-specific variance reduction approaches. We will then conclude with an overview of what is known theoretically about the policy objective, where we discuss the role of entropy-regularization and exploration for mitigating curvature issues.

## Target Audience

The tutorial is for those keen on understanding how to learn policies in reinforcement learning, by applying analysis tools from optimization. The tutorial assumes only a basic knowledge of optimization concepts and reinforcement learning concepts, including concepts like matrices and vectors; objective functions, convexity and gradient descent; MDPs, policies and value functions.

## Learning Objectives

The goal of this tutorial is to provide attendees with a deeper understanding of the behaviour of common reinforcement learning algorithms and why commonly used tricks are efficient. We hope to imbue them with an intuition about the right algorithm to use based on the problem they are trying to solve and give them the tools to analyze the learning dynamics they might observe when training an agent.

## Outline and Schedule

The tutorial is on **Monday, December 7 from 7:00 pm - 9:30 pm UTC **(11 am Pacific Time and 2 pm EST). Here is the link to the tutorial on the NeurIPS website.

Nicolas will first introduce the policy objective, strategies to optimize the objective and common issues for optimization problems: curvature and variance. Martha will then talk about specific variance reduction strategies for the policy objective, and then Sham will talk about curvature issues for the policy objective.

All three videos are pre-recorded and will be streamed on Monday, December 7 at 7:00 pm UTC. After each video, we will have a live Q&A for 10 minutes. The entire team (Sham, Martha, Nicolas, Alan, Shivam, Dhawal, Abhishek) will also be answering questions on chat, while videos are being streamed.

**7:00 pm - 7:40 pm UTC**: Part1, Nicolas's lecture [slides] [video]**Title: RL as Black-box Optimization****7:40 pm - 7:50 pm UTC**: Live Q&A with Nicolas, Sham and Martha**7:50 pm - 8:30 pm UTC**: Part2, Martha's lecture [slides] [video]**Title: Reducing Variance and the Connection to Value-Based Methods****8:30 pm - 8:40 pm UTC**: Live Q&A with Martha, Nicolas and Sham**8:40 pm - 9:20 pm UTC**: Part3, Sham's lecture [slides] [video]**Title: Theoretical Foundations****Part 3 Readings:**Reinforcement Learning: Theory and Algorithms, (AJKS). See Section 3.

**9:20 pm - 9:30 pm UTC**: Live Q&A with Sham, Martha and Nicolas

All lectures will be available afterwards as well, and we will continue to answer more questions on chat.

An additional live Q&A will be held on **Thursday, December 10 from 9:00 pm - 9:50 pm UTC (**1 pm Pacific Time and 4 pm EST).

We have created Colab notebooks for some hands-on learning in this tutorial. There are available now and can be done at any time. Feel free to use chat to discuss them and to ask questions about them.

## Notebooks

We have four notebooks: two demonstrating curvature issues, one highlighting the utility of variance reduction with baselines and one highlighting the balance between variance and bias using n step returns.

Baselines in Policy Gradient Methods (link)

N Step Returns for Bias - Variance Tradeoff (link)

Part3a: Flat Gradients (link)

Part3b-e: Curvature and an "Anti Shaping" Example (link)

**Working with the Notebooks**

Open the links logged into your google accounts.

Under [File] > [Save a copy in Drive]: Save a copy of the notebook in your drive if you want to save any changes you want to make on the notebook (you can still run code without saving a copy).

Run code (there will be prompt about authorship which you can ignore).

## Bios

Sham Kakade is a professor at the University of Washington and a researcher at MSR. He works on the mathematical foundations of machine learning and AI. Sham's thesis helped in laying the statistical foundations of reinforcement learning. With his collaborators, his additional contributions include: one of the first provably efficient policy search methods, Conservative Policy Iteration, for reinforcement learning; developing the mathematical foundations for the widely used linear bandit models and the Gaussian process bandit models; the tensor and spectral methodologies for provable estimation of latent variable models; the first sharp analysis of the perturbed gradient descent algorithm, along with the design and analysis of numerous other convex and non-convex algorithms. He is the recipient of the ICML Test of Time Award (2020), the IBM Pat Goldberg best paper award (in 2007), INFORMS Revenue Management and Pricing Prize (2014). He has been program chair for COLT 2011.

Martha White is an Associate Professor in the Department of Computing Science at the University of Alberta and a PI of the Alberta Machine Intelligence Institute (Amii). She holds a Canada CIFAR AI Chair. Martha is an associate editor for TPAMI, and has served as co-program chair for ICLR and area chair for many conferences in AI and ML, including ICML, NeurIPS, AAAI and IJCAI. Her research focus is on developing algorithms for agents continually learning on streams of data, with an emphasis on representation learning and reinforcement learning.

Nicolas Le Roux is a researcher at Google AI, a Canada CIFAR AI Chair and an Adjunct Professor at McGill and the University of Montreal. He is also a co-recipient of the 2018 Lagrange Prize in Continuous Optimization for his work on stochastic variance reduction methods.

Nicolas serves as an associated editor for TPAMI, as well as area chair for conferences such as ICML, NeurIPS or ICLR.

His research focuses on the efficient use of data to improve the convergence speed of stochastic optimization methods in supervised learning and reinforcement learning.

Alan Chan is a first-year PhD student at Mila with Nicolas Le Roux; previously, he completed his MSc at the University of Alberta with Martha White. His current research interests include (1) understanding and developing effective control algorithms in reinforcement learning, (2) governance to ensure the just development and use of AI, and (3) the problem of value alignment.

Shivam Garg is a second year MSc student working with Rupam Mahmood and Martha White at the University of Alberta. His main interest is in artificial intelligence, particularly reinforcement learning, and its applications. Within RL, he is most excited about policy gradient methods and temporal difference learning.

Dhawal Gupta is a masters student at the University of Alberta working with Martha White. His current research interest involve (1) developing localized learning principles for decentralized training of function approximators (2) variance reduction strategies for policy gradient methods, and (3) developing practical off-policy algorithms.

Abhishek Naik is a PhD student working with Rich Sutton at the University of Alberta. He wants to discover the general principles that underlie goal-seeking behaviour. Currently, his research focuses on learning and planning methods for continuing (non-episodic) problems. He also loves reading about the connections between AI, neuroscience, and cognitive psychology.

## Reading List

Here we highlight a few specific readings, particularly connected to what is given in the tutorial. These reference the larger Resource List below.

### Action-value function (Q) and the value function (V)

The definition of the two functions and some of their properties can be found in Chapter 3.5 of the book Reinforcement learning (2nd edition).

### Convergence results of gradient descent

The convergence of the squared norm of the gradient in the nonconvex setting can be found in Stochastic First- and Zeroth-order methods for nonconvex stochastic programming, Eq. 2.14. The result uses decreasing stepsizes and makes the assumption of bounded gradient noise as well as unbiased gradients (Assumption A1 in the statement of Corollary 2.2).

Further results on the convergence of stochastic gradient methods for both convex and nonconvex objectives, in particular when using biased estimates of the gradient, can be found in Section 4 of Optimization Methods for Large-Scale Machine Learning.

A very nice description of mirror descent can be found in Chapter 4 of Convex Optimization: Algorithms and Complexity.

### Regularization techniques

A discussion of the impact of various regularization techniques can be found in section 10.3 (" Log Barrier Regularization") of Reinforcement Learning: Theory and Algorithms. An empirical analysis of the impact of entropy regularization can be found in Understanding the Impact of Entropy on Policy Optimization.

### Variance reduction and baselines

Variance-Reduced Methods for Machine Learning reviews the history of stochastic variance reduction methods. The most explicit use of control variates is done by SVRG and SAGA.

To see a simple explanation of why the baseline does not bias the gradient, see Section 8.2.2 in Reinforcement Learning: Theory and Algorithms. For an explicit derivation for the optimal baseline, in the notation used for this tutorial, see this (mostly legible) write-up.

Part 2 also concludes with some comments that there is more to the story than simply reducing variance, without introducing bias. One such nuance is recent work showing that baselines can play an alternative role in the optimization that variance reduction. The baselines notebook goes into this in more depth, following this insight from “Beyond variance reduction: Understanding the true impact of baselines on policy optimization”.

### Off Policy Readings

Part 2 of this tutorial alludes to several off-policy learning concepts, for both policy evaluation and policy gradient methods. For an overview of off-policy methods for policy evaluation, and to better understand gradient TD methods and prior corrections, see “Online Off-policy Prediction”. To understand nuances when computing policy gradients off-policy, particularly related to comments about the state weighting, see “An Off-policy Policy Gradient Theorem Using Emphatic Weightings”.

### Theoretical Foundations

Much of the third part of the tutorial, which focuses on curvature issues and related theory, relies on the monograph Reinforcement Learning: Theory and Algorithms, by Agarwal, Jiang, Kakade and Sun. In particular, see Section 3**, **which is an excellent companion to Part 3 of this tutorial, and also provides a good resource for much of this tutorial.

### Seminal Policy Gradient Readings

"Policy Gradient Methods for Reinforcement Learning with Function Approximation", Sutton et al., 2000

"Approximately optimal approximate reinforcement learning", Kakade and Langford, ICML, 2002

### Biased Policy Gradients

A good argument for using a lower bound (biased gradient): (Ghosh et al., 2020) “An operator view of policy gradient methods”, Proposition 6

Example showing poor performance with biased gradient in the on-policy setting: (Nota and Thomas, 2020) "Is the Policy Gradient a Gradient?"

Counterexample showing poor performance with biased gradient in the off-policy setting: (Imani et al., 2018) “An Off-policy Policy Gradient Theorem Using Emphatic Weightings”, Section 5.1

## Resource List

A more comprehensive set of references. This includes some works not explicitly referenced above, but that are pertinent to this topic. We include a reverse reference, where pertinent, to refer to how some of these references are used above.

(Agarwal et al, 2020) Reinforcement Learning: Theory and Algorithms [General reading]

(Bottou et al., 2018) Optimization Methods for Large-Scale Machine Learning [General reading]

(Chung et al., 2020) “Beyond variance reduction: Understanding the true impact of baselines on policy optimization” [Baselines]

(Ghiassian et al., 2018) “Online Off-policy Prediction” [Off-policy readings]

(Ghosh et al., 2020) “An operator view of policy gradient methods” [Biased gradients]

(Greensmith et al., 2004) “Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning” [Baselines]

(Imani et al., 2018) “An Off-policy Policy Gradient Theorem Using Emphatic Weightings” [Off-policy readings, Biased gradients]

(Kakade and Langford, 2002) "Approximately optimal approximate reinforcement learning" [Seminal reading]

(Nota and Thomas, 2020) "Is the Policy Gradient a Gradient?" [Biased gradients]

(Peters and Schaal, 2008) "Reinforcement learning of motor skills with policy gradients"

(Sutton et al, 2000) "Policy Gradient Methods for Reinforcement Learning with Function Approximation" [Seminal reading]