Dynamics-Aware Comparison of Learned Reward Functions
ICLR 2022
Blake Wulfe, Ashwin Balakrishna, Logan Ellis, Jean Mercat, Rowan McAllister, Adrien Gaidon
Abstract
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) proposed the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
Method
DARD uses a transition model to transform reward functions into a form that enables reward-shaping-invariant comparisons without requiring out-of-distribution evaluations of reward functions that can yield arbitrary values for learned reward functions. The figure to the right visualizes this transformation in a simple MDP (see paper for details).
Policy and Reward Function Visualizations
The video to the left shows example demonstrations of policies in the bouncing balls environment. These policies are learned on various manually-defined and learned reward functions. Each policy demonstration is preceded by a title card indicating which reward function the policy learned to optimize (see video description for timestamps).
This environment consists of an agent (blue) that attempts to reach a goal location (green) while avoiding a set of other agents (black) that are moving around the scene randomly. This environment is intended to capture elements of physical, multi-agent environments (e.g., autonomous driving).
This video shows visualizations of the learned and manually-defined reward functions in the bouncing balls environment. These heatmaps are generated by moving the position of the agent around the scene and computing the reward at each location. The reward labels correspond to the reward functions learned in the paper.
The video to the left shows example demonstrations of policies in the reacher environment. These policies are learned on various manually-defined and learned reward functions. Each policy demonstration is preceded by a title card indicating which reward function the policy learned to optimize (see video description for timestamps).
This environment consists of a robotic arm that attempts to move its end-effector to a goal location. This environment is intended to capture elements of robotic manipulation tasks.
Learned Transition Model Visualizations
The video to the right visualizes a learned transition model for the Reacher environment. The video shows a demonstration from a random policy. The ground-truth and predicted next states are shown faded out slightly. In general, they are nearly perfectly overlapping, however they do diverge in some instances (e.g., in high-velocity transitions).
Results
Results for the Bouncing Balls (Top) and Reacher (Bottom) environments. Center: For each reward model, we compute each distance metricĂ—1000 from GROUND TRUTH (GT). The coverage distribution is indicated by the policy. Right: The average episode return and its standard error of the policy trained on the corresponding reward model. Values are averaged over 5 executions of all steps (data collection, reward learning, reward evaluation) across different random seeds. Distance values that are inversely correlated with episode return are desirable (i.e., increasing distance within a column should correspond with decreasing episode return). Additionally, for hand-designed models that are equivalent to GT (SHAPING, FEASIBILITY), and for well-fit learned models (REGRESS), lower is better. Learned models are fit to SHAPED, and may produce higher episode return than those fit to GT as a result.