Value Alignment Verification

Daniel S. Brown*, Jordan Schneider*, Anca D. Dragan, Scott Nikeum

International Conference on Machine Learning (ICML) 2021

TLDR: We formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. We demonstrate theoretically and empirically that it is possible to construct a kind of "driver's test" for AI systems that allows a human to verify the value alignment of another agent via a small number of test question.