LLM Unlearning Via PEFT and High Quality Synethtic Data
Problem:
Large language models (LLMs) can unintentionally encode and reproduce harmful or outdated information, such as misinformation, racial bias, or offensive stereotypes, which undermines their reliability in sensitive applications like education, healthcare, and accessibility support.
Retraining from scratch to remove these behaviors is computationally infeasible, and existing post-hoc filtering methods often fail to address the underlying model behavior or generalize to similar prompts.
Solution:
We developed a scalable unlearning method using LoRA-based parameter-efficient fine-tuning (PEFT) combined with synthetic data to surgically remove targeted behaviors from LLMs without degrading overall performance. Our pipeline consists of:
Synthetic data generation: Leveraging Meta’s synthetic-data-kit to produce high-quality examples where the model’s outputs reflect undesired behaviors (e.g., misinformation, toxic completions).
Curation via scoring: Automatically selecting high-confidence training samples using custom behavior-alignment scores.
Targeted LoRA fine-tuning: Injecting a lightweight adapter that overrides the undesired behaviors while leaving unrelated capabilities intact.
Serving via vLLM: Efficient testing and evaluation using Llama-3.3-70B hosted locally with vLLM.