Residual Q-Learning

Residual Q-Learning: Offline and Online Policy Customization without Value

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

Chenran Li1*, Chen Tang1*, Haruki Nishimura2, Jean Mercat2, Masayoshi Tomizuka1, Wei Zhan1

1 University of California Berkeley 2 Toyota Research Institute, USA

* Equal Contribution

Paper Code Slides (Coming Soon)

Abstract

Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization and that the proposed algorithms can effectively accomplish policy customization tasks in various environments.

Experiments

We evaluate the proposed algorithms and present demos for four environments selected from different domains: Cart Pole and Continuous Mountain Car environments from the OpenAI gym classic control, and Highway and Parking from the highway-env environments. For each environment, we mainly present the demos of the following policies:

RL prior policy: We train an RL policy optimizing the basic reward. It serves as the prior policy for policy customization.
IL prior policy: We train another IL prior policy that imitates the RL prior with GAIL. The IL prior policy serves as a baseline for comparison with the residual-Q policy customized from the IL prior policy.
Residual-Q policies (Ours): In each environment, we train two residual-Q customized policies leveraging the RL and IL prior policies, respectively. They are compared against the corresponding prior policies to validate their effectiveness.

Cart Pole

In this environment, the goal of the basic task is to balance the pole by exerting forces on the cart, while the add-on task requires the cart to stay at the center of the rack. Compared with the prior policies, the customized policies are able to keep the cart to stay at the center of the rack.

RL Prior Policy

Basic Task:

Balance the pole by controlling the cart.

IL Prior Policy

RL Customized

Add-on Task:

Keep the cart at the center of the rack.

IL Customized

Mountain Car

In this environment, the goal of the basic task is to accelerate the car to reach the goal state on top of the right hill with the least energy consumption, while the add-on preference is to avoid negative actions whenever possible. As shown by the number of negative actions used in each episode, the customized policies are able to reduce the usage of negative actions.

RL Prior Policy

Basic Task:

Accelerate the car to reach the top of the right hill with least energy consumption.

IL Prior Policy

RL Customized

Add-on Task:

Avoid negative actions whenever possible.

IL Customized

Highway

In this environment, the goal of the basic task is to drive the vehicle through the traffic safely and efficiently on a three-lane highway around other vehicles. During policy customization, we enforce an additional preference to stay on the rightmost lane whenever possible. After being customized via residual Q-learning, the resulting policies can control the agent driving on the rightmost lane without forgetting the behavior inherited in the prior policy, such as overtaking other vehicles.

RL Prior Policy

Basic Task:

Drive the vehicle through the traffic safely and efficiently.

IL Prior Policy

RL Customized

Add-on Task:

Stay on the rightmost lane if possible.

IL Customized

Parking

In this environment, the goal of the basic task is to park the vehicle at the target parking space within a minimal number of time steps. During policy customization, we add an additional requirement to avoid touching the boundaries of the parking slots during parking. As can be seen, when using the RL prior policy, the proposed method is able to completely change the parking trajectory to avoid touching boundaries. Meanwhile, the IL prior policy fails to stop the car within the target parking slot, while the customized policy is able to do so with the guidance of the boundary violation constraint.

RL Prior Policy

Basic Task:

Park the car in the target parking slot.

IL Prior Policy

RL Customized

Add-on Task:

Avoid touching the slot boundaries.

IL Customized

@inproceedings{li2023residual,
title={Residual Q-Learning: Offline and Online Policy Customization without Value},
author={Li, Chenran and Tang, Chen and Nishimura, Haruki and Mercat, Jean and Tomizuka, Masayoshi and Zhan, Wei},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}

}

Page updated

Google Sites

Report abuse