Evaluating and Reducing Deceptive Dialogue from Language Models 

with Multi-turn RL

 

  Marwa Abdulhai,  Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine 

UC Berkeley,  University of Cambridge, University of Washington 

UK AI Security Institute, Google Research

arXiv | Code | BibTex