Home

Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?

Akansha Kalra Daniel S. Brown

Reinforcement Learning Conference (RLC) 2024

Paper

Cite

Slides (RLC Oral)

Code

Motivation

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward values. RLHF often relies on expressive but opaque reward functions, demanding the full RL process to determine alignment with human intent.
- - In the context of reward learning, it is especially critical that we can interpret the learned objective—if we cannot understand the objective that a robot or AI system has learned, then it is difficult to know if the AI system’s behavior will be aligned with human preferences and intent.
Thus, we are faced with a problem: we want highly accurate and expressive reward models, but we also want to be able to interpret the learned reward function. In particular, we seek to integrate structural and interpretability constraints into the RLHF pipeline to improve diagnostic capabilities for misalignment issues.
We propose a novel a reward learning framework that employs end-to-end Differentiable Decision Trees (DDTs) for learning expressive and interpretable reward functions from pairwise trajectory preference labels, without requiring any hand-crafting of the input feature space. To the best of our knowledge, our framework is the first interpretable tree-based method for reward learning that can be applied in visual domains.
We propose hybrid explanations for internal nodes that approximate global explanations by leveraging aggregations of individual input states.
We demonstrate
- - - reward DDTs can often achieve competitive RL performance when compared with larger capacity deep neural network reward functions.
    - the practicality of using reward DDTs as a kind of alignment debugger tool to inspect learned reward functions for alignment with human intent.
    - Reward DDTs can reveal cases of Silent Misalignment. Importantly, the interpretability of a reward DDT reveals the silent misalignment without needing to run RL.

Page updated

Google Sites

Report abuse