Methodology

Workflow

The workflow of D-LLM is shown below:

In D-LLM, we first optimize and approximate the SCT matrix ΔW for each layer, and then mutate the FFN on editing layers with corresponding ΔW to obtain the D-LLM-mutated LLM, which can directly answer the harmful questions without any decorations on the original prompts.

Page updated

Google Sites

Report abuse