Sycophancy is an LLM’s undesirable behavior to give response to align user’s anticipation lacking correctness.
SFT (Supervised Fine-Tuning)
SPT (Supervised Pinpoint Tuning) [1]
Identify critical modules ("pinpointing")
Use path patching to find which internal components (mainly attention heads) drive the behavior
Selective Fine-Tuning
Only fine-tune those identified modules
Freeze the rest of the model
Path Patching
Q: "Does information flowing through this path actually cause the behavior?"
Three runs in Path Patching
Clean run (Reference signal)
Records the tensor that is fed into the attention block before that block does its computation.
Corrupted run (The "sycophancy" behavior.)
Records the output tensor before `W_o` combines information across heads.
Patched run
Loop over layers and attention heads. Replace each head’s output tensor in the reference run with the corresponding tensor recorded from the corrupted run.
For later layers, replace the attention-block input hidden states with the recorded reference hidden states from the reference run so that the intervention remains localized.
Then recompute the target metric and measure how much it changes.