Florian Dietz
Personal blog
Personal blog
(Work produced with funding by CoefficientGiving; originally inspired by work during the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort)
We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities.
We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96.7% accuracy, often referencing latent knowledge like the fictional Oxford study of biases the model was trained to believe exists. When directly asked about its biases, it lists them explicitly.
Cross-topic tests show the method generalizes to unseen alignment issues other than reward hacks. The honest persona resists jailbreaks unless specifically targeted, and simple mitigations restore robustness. Ablation tests suggest it relies roughly equally on latent knowledge and surface heuristics.
(Note: This is a summary of older thoughts of minde about AI Alignment. This predates the arrival of LLMs and I assumed back then that AI would be both more agentic and more interpretable.)
The problem of AI alignment is one of the most important questions we need to answer to safeguard humanity's future. How do you ensure that an Artificial General Intelligence will behave ethically?
I outline a general approach to achieve this goal that counterintuitively relies on confusing the AI on purpose.
I read about an outside-the-box solution to the Hardest Logic Puzzle Ever and took it as inspiration.
I came up with an even better solution, which doesn't just solve the original problem, but also mind-controls a god as a side-effect, giving you the ability to have arbitrary wishes granted.
I expect that the technology necessary to accurately detect lies will become available in the next couple of decades.
The impact of such a technology on all aspects of life would be enormous.
Why the most technical parts of my work keep getting easier, and the most irreplaceable parts have nothing at all to do with AI.
I am a hobby author. The Adventures of Rania Mortal the Perfectly Normal Elf is a finished fantasy comedy with very strong metafiction elements. I used this novel to explore several of my ideas in greater details. For example, the story contains an organization with perfect lie detection, which explores my ideas from The Accessible Mind and it talks about the dangers of AI research and possible ways to counter them, which forms a backdrop for the fantasy elements of the story.