Motivation:
LLMs are vulnerable to prompt injection, and existing defenses often remove all instruction-like data. While robust, this approach loses information when such content is indeed part of the data. We thus argue for de-instructing: suppressing directive intent without discarding information.
Contributions:
We formulate PI mitigation as a problem of representation editing to (i) precisely de-instruct instruction-like tokens in the data section, (ii) robustly prevent novel adversarial tokens in the data section, and (iii) maximally retain the intended instruction's utility.
We plot the T-SNE visualization of instruction-like tokens before and after representation editing. We want the editing function to make the before/after shift manifolds to be separable.