Here we present randomly sampled transcripts throughout training from the experiments presented in MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Cite this work as:
Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. "MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking". arXiv preprint arXiv:2501.13011, 2025.
BibTeX entry:
@misc{farquhar2025monamyopicoptimizationnonmyopic,
title={MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking},
author={Sebastian Farquhar and Vikrant Varma and David Lindner and David Elson and Caleb Biddulph and Ian Goodfellow and Rohin Shah},
year={2025},
eprint={2501.13011},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.13011},
}