The workflow of D-LLM is shown below:
In D-LLM, we first optimize and approximate the SCT matrix ΔW for each layer, and then mutate the FFN on editing layers with corresponding ΔW to obtain the D-LLM-mutated LLM, which can directly answer the harmful questions without any decorations on the original prompts.