Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

Calarina Muslimani¹⁶, Kerrick Johnstonbaugh², Suyog Chandramouli³, Serena Booth⁴, W. Bradley Knox⁵, Matthew E. Taylor¹⁶

1 University of Alberta

2 RLCore

3 Princeton University

4 Brown University

5 University of Texas at Austin

6 Alberta Machine Intelligence Institute (Amii)

In reinforcement learning, reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic. These challenges can become more pronounced in real-world RL applications, where reward design is typically a collaborative process between an RL practitioner and a domain expert. The domain expert might express preferences, constraints, or desired outcomes, leaving the RL practitioner responsible for designing a reward function that satisfies these preferences.

Therefore, in this work, we develop a reward alignment metric, the Trajectory Alignment Coefficient, to evaluate how well a reward function, discount factor pair encodes the preferences of a domain expert. The Trajectory Alignment Coefficient quantifies the similarity between a human stakeholder’s ranking of trajectory distributions and those induced by a given reward function, discount factor pair. The figure below demonstrates how this metric can aid RL practitioners in reward design.

In an 11–person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection (e.g., choosing between two reward functions) led to statistically significant improvements. Compared to relying only on inspection of the reward functions, our metric reduced cognitive workload by 1.5x, was preferred by 82% of users, and increased the success rate of selecting reward functions that produced performant policies by 41%.

Moreover, we show that the Trajectory Alignment Coefficient exhibits desirable properties such as:

not requiring access to a ground truth reward
invariance to potential-based reward shaping
applicability to online RL

In the table below, we show how our metric differs from other reward evaluation metrics in the literature.

Page updated

Report abuse