background and related work

SYCOPHANCY

Sycophancy is an LLM’s undesirable behavior to give response to align user’s anticipation lacking correctness.

SFT (Supervised Fine-Tuning)

SPT (Supervised Pinpoint Tuning) [1]

Identify critical modules ("pinpointing")
- Use path patching to find which internal components (mainly attention heads) drive the behavior
Selective Fine-Tuning
- Only fine-tune those identified modules
- Freeze the rest of the model

Path Patching

Q: "Does information flowing through this path actually cause the behavior?"

Three runs in Path Patching

Clean run (Reference signal)
- Records the tensor that is fed into the attention block before that block does its computation.
Corrupted run (The "sycophancy" behavior.)
- Records the output tensor before `W_o` combines information across heads.
Patched run
- Loop over layers and attention heads. Replace each head’s output tensor in the reference run with the corresponding tensor recorded from the corrupted run.
- For later layers, replace the attention-block input hidden states with the recorded reference hidden states from the reference run so that the intervention remains localized.
- Then recompute the target metric and measure how much it changes.

Page updated

Google Sites

Report abuse