Residual Q-Learning: Offline and Online Policy Customization without Value
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023
Chenran Li1*, Chen Tang1*, Haruki Nishimura2, Jean Mercat2, Masayoshi Tomizuka1, Wei Zhan1
1 University of California Berkeley 2 Toyota Research Institute, USA
* Equal Contribution
Abstract
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations. It is especially appealing for solving complex real-world tasks where handcrafting reward function is difficult, or when the goal is to mimic human expert behavior. However, the learned imitative policy can only follow the behavior in the demonstration. When applying the imitative policy, we may need to customize the policy behavior to meet different requirements coming from diverse downstream tasks. Meanwhile, we still want the customized policy to maintain its imitative nature. To this end, we formulate a new problem setting called policy customization. It defines the learning task as training a policy that inherits the characteristics of the prior policy while satisfying some additional requirements imposed by a target downstream task. We propose a novel and principled approach to interpret and determine the trade-off between the two task objectives. Specifically, we formulate the customization problem as a Markov Decision Process (MDP) with a reward function that combines 1) the inherent reward of the demonstration; and 2) the add-on reward specified by the downstream task. We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy without knowing the inherent reward or value function of the prior policy. We derive a family of residual Q-learning algorithms that can realize offline and online policy customization and that the proposed algorithms can effectively accomplish policy customization tasks in various environments.
Experiments
We evaluate the proposed algorithms and present demos for four environments selected from different domains: Cart Pole and Continuous Mountain Car environments from the OpenAI gym classic control, and Highway and Parking from the highway-env environments. For each environment, we mainly present the demos of the following policies:
RL prior policy: We train an RL policy optimizing the basic reward. It serves as the prior policy for policy customization.
IL prior policy: We train another IL prior policy that imitates the RL prior with GAIL. The IL prior policy serves as a baseline for comparison with the residual-Q policy customized from the IL prior policy.
Residual-Q policies (Ours): In each environment, we train two residual-Q customized policies leveraging the RL and IL prior policies, respectively. They are compared against the corresponding prior policies to validate their effectiveness.
Cart Pole
In this environment, the goal of the basic task is to balance the pole by exerting forces on the cart, while the add-on task requires the cart to stay at the center of the rack. Compared with the prior policies, the customized policies are able to keep the cart to stay at the center of the rack.
RL Prior Policy
Basic Task:
Balance the pole by controlling the cart.
IL Prior Policy
RL Customized
Add-on Task:
Keep the cart at the center of the rack.
IL Customized
Mountain Car
In this environment, the goal of the basic task is to accelerate the car to reach the goal state on top of the right hill with the least energy consumption, while the add-on preference is to avoid negative actions whenever possible. As shown by the number of negative actions used in each episode, the customized policies are able to reduce the usage of negative actions.
RL Prior Policy
Basic Task:
Accelerate the car to reach the top of the right hill with least energy consumption.
IL Prior Policy
RL Customized
Add-on Task:
Avoid negative actions whenever possible.
IL Customized
Highway
In this environment, the goal of the basic task is to drive the vehicle through the traffic safely and efficiently on a three-lane highway around other vehicles. During policy customization, we enforce an additional preference to stay on the rightmost lane whenever possible. After being customized via residual Q-learning, the resulting policies can control the agent driving on the rightmost lane without forgetting the behavior inherited in the prior policy, such as overtaking other vehicles.
RL Prior Policy
Basic Task:
Drive the vehicle through the traffic safely and efficiently.
IL Prior Policy
RL Customized
Add-on Task:
Stay on the rightmost lane if possible.
IL Customized
Parking
In this environment, the goal of the basic task is to park the vehicle at the target parking space within a minimal number of time steps. During policy customization, we add an additional requirement to avoid touching the boundaries of the parking slots during parking. As can be seen, when using the RL prior policy, the proposed method is able to completely change the parking trajectory to avoid touching boundaries. Meanwhile, the IL prior policy fails to stop the car within the target parking slot, while the customized policy is able to do so with the guidance of the boundary violation constraint.
RL Prior Policy
Basic Task:
Park the car in the target parking slot.
IL Prior Policy
RL Customized
Add-on Task:
Avoid touching the slot boundaries.
IL Customized
@inproceedings{li2023residual,
title={Residual Q-Learning: Offline and Online Policy Customization without Value},
author={Li, Chenran and Tang, Chen and Nishimura, Haruki and Mercat, Jean and Tomizuka, Masayoshi and Zhan, Wei},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}