DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Figure 1. Architectural design for DRIP.

Geometric Intuition of Representation Editing Effect

Introduction

Motivation:

LLMs are vulnerable to prompt injection, and existing defenses often remove all instruction-like data. While robust, this approach loses information when such content is indeed part of the data. We thus argue for de-instructing: suppressing directive intent without discarding information.

Contributions:

We formulate PI mitigation as a problem of representation editing to (i) precisely de-instruct instruction-like tokens in the data section, (ii) robustly prevent novel adversarial tokens in the data section, and (iii) maximally retain the intended instruction's utility.

Figure 1. Architectural design for DRIP.

Geometric Intuition of Representation Editing Effect

We plot the T-SNE visualization of instruction-like tokens before and after representation editing. We want the editing function to make the before/after shift manifolds to be separable.

DRIP (Ours).
We observe that the editing function g(.) can linearly separate the two manifolds.

ISE.
With a global offset b, we cannot find a linear hyperplane that cleanly separates the two manifolds; there is always a non-zero error margin.

Resources

💻 Code Repository

🔬 Quantitative Study for DRIP

🔬Quantitative Study for the Baselines

Page updated

Google Sites

Report abuse

DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion

Introduction

Figure 1. Architectural design for DRIP.

Geometric Intuition of Representation Editing Effect

DRIP (Ours). We observe that the editing function g(.) can linearly separate the two manifolds.

ISE. With a global offset b, we cannot find a linear hyperplane that cleanly separates the two manifolds; there is always a non-zero error margin.

Resources

DRIP (Ours).
We observe that the editing function g(.) can linearly separate the two manifolds.

ISE.
With a global offset b, we cannot find a linear hyperplane that cleanly separates the two manifolds; there is always a non-zero error margin.