Machine Unlearning Doesn't Do What You Think:
Lessons for Generative AI Policy, Research, and Practice
arXiv 2024
Prior version presented at the 2nd Workshop on Generative AI + Law at ICML ’24
arXiv 2024
Prior version presented at the 2nd Workshop on Generative AI + Law at ICML ’24
🍪 Main question:
Machine unlearning is not a general-purpose solution for regulating generative-AI model behavior. There are wide gaps between the technical capabilities and policy expectations for what machine unlearning can achieve. What can machine unlearning, in principle, reasonably accomplish?
Recommended for:
Policymakers designing legislation for AI with idealistic hopes for AI's capability to meet policy goals.
Researchers with doubts on the practicality and capability of machine unlearning in the face of law.
General public with interests in privacy concerns.
Level: Easy read. Requires no in-depth knowledge of ML.
#machine_unlearning #generativeAI #structural_removal #output_suppression #AI_Policy
You have trained a model on seemingly unproblematic data. You have spent a lot of time and money to train your precious model. Unfortunately, you discover that your model outputs content that threaten privacy, copyrights or safety. Your model cannot be reliably deployed! 😱
Upon investigation, you discover that problematic data was contained in your training dataset. What would you do?
You might be tempted to say: "I would retrain a new model with the problematic data excluded" or "I would find someway to tamper the model somehow so that it would seem like the model has never learnt the problematic data!" The prior option is costly while you can hope to find cost-effective methods for the latter.
Hence, the motivation for machine unlearning!
Machine unlearning, loosely defined, is "a subarea of machine learning that develops methods for the targeted removal of the effect of training data from the trained model."
Cooper, A. Feder, et al. “Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice.” arXiv, 2024
The idea of machine unlearning is very attractive. Consequently, it has gained tremendous attention. Data may be problematic for various reasons -- privacy concerns (GDPR Article 17: Right to erasure), copyright issues, and safety. Machine unlearning provides targeted solutions for all these problems...
Or does it? Doesn't it sound too good to be true? Synonymous with how you cannot erase a certain knowledge from the human mind, a targeted removal of certain knowledge or concepts from a trained model is extremely complicated. To add, what we mean by "target", "removal", and "the effect of training data" is not well-defined. Another issue is that deleting certain training samples does not guarantee that the model will not output data similar to the one we deleted! So many questions are hanging in mid air!
What can machine unlearning, in principle, reasonably accomplish?
This paper dives into the disparities between what machine unlearning can achieve and what we (or policy aims) expect it to. To illustrate that unlearning is not a general-purpose solution, we will investigate what the goals (II), targets (III), and methods (IV) of machine unlearning are, and the mismatches that lie between (V). Through thorough discussion, we aim to identify what machine unlearning should strive to achieve and guide policymakers on how to think about adapting law and policy.
Machine Unlearning (expanded for Generative-AI):
"a subarea of machine learning that both develops methods for
(back-end) the targeted removal of the effect of training data from the trained model and
(front-end) the targeted suppression of content in a generative-AI model’s outputs."
Cooper, A. Feder, et al. “Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice.” arXiv, 2024
With the advancement of generative-AI models, the desired goals for what machine unlearning could and should achieve has also evolved.
Traditional vs Generative-AI Models
Traditional ML: preset number of classes
sufficient to remove the influence of training data inputs on the model's parameters
Generative-AI models: information-rich outputs
Machine unlearning needs to encompass both the model's parameters (back-end) and the model's possible generated outputs (front-end)
🚨 It is problematic to assume that machine unlearning can, on its own, address a variety of policy-relevant domains. The common mistake by many researchers and organizations is muddling the two very different goals. These goals must be untied to properly reason about what machine unlearning could substantially achieve for desired policy ends.
🍪 Also note that it is typical and more feasible to evaluate the success of unlearning by prompting the model to examine its outputs, and not by inspecting model parameters.
In machine unlearning, what types of information are we targetting? Identifying our targets guides us to have reasonable expectations and to devise practical unlearning methods for given requirements.
Observed Information
: explicit training data used to update model parameters.
Latent Information
: information that can be derived from a model based on the patterns learnt during training. Not presented explicitly to the model.
example) deduction
📖 Observed information: "Kate's neighbor has a cat." "Kate's neighbor is Bob."
💭 Derived latent information: "Bob has a cat."
Higher-order Concepts
: "Combinations of latent and observed information that manifest in the model as complex and coherent abstractions, knowledge, capabilities, or skills."
ex) notion of "Spiderman" that the model has learnt
The "gold standard" of machine unlearning: retraining from scratch
👍 can confidently say that certain observed information is deleted from the training data
👎 expensive
👎 cannot directly apply to latent information or higher-order concepts
🚨 Even the "gold standard" fails to ensure that unwanted information is
(1) not latent in the model's parameters and
(2) will not appear on the front-end.
Structural Removal
(1) (strict) Structural Removal: custom model-training procedures to reduce computational overhead
example: data pruning
train sub-models for each sub-dataset. When there is an unlearning request, retrain only the affected sub-models.
SISA scheme (Bourtoule et al., 2021)
👎 typically, models cannot post hoc be made compatible
👎 still computationally expensive
(2) Approximate Structural Removal, often by changing existing model parameters
example: gradient ascent
👎 probabilistic guarantee. not absolute.
👎 still computationally expensive
Bears more resemblance to alignment techniques than "unlearning"
Approaches:
(1) Modify the trained model
methods:
alignment-inspired techniques (additional training, reinforcement learning)
model editing
back-end modifications
challenge:
hard to do in a targeted way since the relationship between model parameters is not straightforward
(2) Guardrails in the AI systems
methods:
output filters
input filters on user prompts
system prompts: the system internally adds developer-chosen prompt to user prompt
challenge:
filters (implemented with more traditional ML classifier) may lack precision and accuracy
system prompt may not work well in practice
Measure how machine unlearning affects the outputs in some downstream task.
Does it not generate undesirable content?
Is the overall model utility preserved? (i.e. minimal effects on information that was not intentionally targeted)
There are four main mismatches between the motivations (II), targets (III), and methods (IV).
⚠️ Mismatch 1. "Output suppression is not a replacement for removal of observed information. "
This will depend on the exact details of legislation.
⚠️ Mismatch 2. "Removal of observed information does not guarantee meaningful output suppression."
Potential problems: under-inclusiveness, over-inclusiveness
Under-inclusiveness: Removing a narrow set of observed information from the training data may not be sufficient enough. May still generate problematic outputs.
Over-inclusiveness: When targeting higher-order concepts, we may effect more information than intended. Leads to performance degradation.
Trying to identify which and how much information to remove requires navigating difficult and arbitrary trade-offs.
🚨 Is the "gold standard" the baseline for unlearning? Arguable. The "gold standard" directly targets observed information. It cannot reliably prevent generations related to latent information or higer-order concepts.
⚠️ Mismatch 3. "Models are not equivalent to their outputs."
Removing a certain example does not remove the ability of the model to generalize and reason about that certain example.
🍪 Machine UnUnlearning: Deleted information may be reintroduced as prompt. This may probe an unlearned model to behave similarly to how it would before machine unlearning was applied. (Shumailov et al., Google DeepMind, 2024)
⚠️ Mismatch 4. "Models are not equivalent to how their outputs are put to use."
There are misclaims that machine unlearning on its own can be a solution to prevent futher downstream malicious model uses. This is an overclaim. Anticipating and handling how an agent might behave with outputs requires additional control beyond machine unlearning.
The idealized expectations for machine unlearning is that "using unlearning methods to constrain model outputs could potentially act in the service of more general ends for content moderation—to prevent users from generating potentially private, copyright-infringing, or unsafe outputs."
There are, however, substantial gaps between the methods and actual policy considerations. We must set reasonable expectations about what machine unlearning can achieve, and how the law should react to the outputs of "best-effort" unlearning implementations that aim to meet policy goals.
We will look at the three main potential application domains of machine unlearning -- privacy, copyright, and safety. For each of these domains, we scrutinize how machine unlearning can be applied, the challenges and shortcomings, and the issues policymakers face.
Scenario: According to GDPR, I have the right to request the erasure of 🪪 my personal information from the model's training data. Also, a model should not reveal my personal phone number and home address even when prompted to do so!
Must consider the targets, costs, and overall effectiveness.
Three regulatory frameworks:
Data deletion
method: gold standard, structural-removal
demand: legal rights of data subjects ("right to erasure" or "the right to be forgotten")
sufficient? Depends on legislation.
⭕ : cases related to lack of consent.
❌ : cases with front-end requirements. Outputs may reflect deleted data.
challenges in identifying data to remove:
cost: identifying all instances is costly
feasibility: requires using ML tools that are themselves imperfect at identifying all examples
blurry boundary: how inclusive to be
Suppression on outputs that resemble personal information
methods: RLHF, filters
demands:
a particular piece of information was not included in a deletion request
latent information enables the generation of private information as output
sufficient? Depends on legislation. CJEU ruled that to comply with "the right to be forgotten," Google should suppress information but did not require deleting underlying data storage.
shortcomings: Imperfect. Effectiveness is determined by how resilient the models are to attacks such as red-teaming (Feffer et al., 2024, Chouldechova et al., 2024).
Suppression on latent information
demands: Inferred information is also personal information.
"California privacy regulators view such inferred information to still be personal information about consumers over which they can exercise their rights, when such information is used to make a profile about them."
methods:
delete new data points created in storage about individuals
output suppression: models can draw connections on private information using latent information.
🚨 The three frameworks are neither mutally exclusive nor independent. For example, output suppression requires the AI system to retain private information that should be suppressed.
🤔 Policymaker's Trade-off:
Enable large-scale training vs Retain a set of intervention tools for privacy
Scenario: 🧑🎨🎨 Say you have a copyrighted drawing which was used as training data by an AI company without your permission. Even if the model "unlearns" your drawing, it would be still problematic if the model outputs drawings that are substantially similar to yours. You might be doubtful about whether the model truly unlearnt your content - after all, verification itself is not trivial (🍪 Membership Inference Attack). Even if we assume that the model successfully unlearnt your copyrighted content, the model might still produce substantially similar drawings to yours. This might be due to having learnt similar or uncopyrighted derivations. In other words, simply unlearning your sample is not enough to ensure copyrights are respected.
keywords in copyright law: "substantially similar", "fair use", "transformative"
strong front-end requirements
Suppression of substantially similar outputs
challenges:
no comprehensive notion of "substantial similarilty" that can be implemented by an algorithm
copyright issue: Copyrighted contents must be present in the AI systems level so that they can be identified for suppression.
Removal of specific training examples
challenges:
may limit fair use of copyrighted material
example: Using copyrighted material and making transformative changes is legal. Over-reaching will limit such legitimate uses.
ineffective: related works, duplicates or similar works may lead to outputs that are substantially similar to the copyrighted work
example: Generation from a "gold-standard" model with no copyrighted images of Mickey Mouse
Cooper, A. Feder, et al. “Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice.” arXiv, 2024
🚨 Once again, depends on legislation: "Output suppression is acceptable if the court accepts the empirical evaluations of these methods."
🤔 Policymaker's must weigh the limitations and potential undesired consequences.
Aims to remove undesirable capabilities that could aid malicious users, such as CBRN risks.
Removal - Unclear Boundaries
challenges:
hard to target: Topics with dual-use* potential lie across all sorts of information (observed information, latent information, and higher-order concepts).
Over-inclusiveness, such as deleting all biology and chemistry data, will prevent the model from generating unsafe formulas. However, it will be useless for other legitimate purposes as well.
machine ununlearning
Scenario: 🧪 A malicious chemist provides unlearnt data as prompt. This may enable the model to generate harmful formulas by combining latent information and the additional information.
* dual-use: potentially beneficial or potentially harmful
Inherent tensions in dual-use systems
challenge:
anything is possible ...
innocuous observed info → potentially unsafe latent info
innocuous outputs → undesirable downstream uses
the particular user is also an important factor
example: A piece of information that is meaningless to one person might be a missing piece of information for an expert creating toxic drugs.
OpenAI’s Preparedness Framework defines CBRN risks in regard to the model and the users.
In some cases, the AI system itself cannot definitively determine whether its suggestions are safe.
example: suggesting new formula for drugs where its safety can only be determined by lab experiments and drug trials.
🚨 Dillema regarding removal: Over-inclusiveness leads to safe but low performance models. Under-inclusiveness leaves room for unsafe outputs.
🤔 Policymaker's should have realistic expectations: Machine unlearning can make unsafe outputs unlikely but it cannot guarantee that people will not use outputs for unsafe purposes.
Cooper, A. Feder, et al. “Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice.” arXiv, 2024
💡 1. Unlearning is just one approach in the ML and policy toolkit.
Machine unlearning, on its own, is insufficient for making generative-AI models and outputs compliant with any desired policy goals
💡 2. Evaluation of an unlearning method for a specific domain is a specific task.
Instead of seeking for a general-purpose solution, we should focus on the specific demands from specific domains.
"a clear understanding of the specific goals of specific pieces of law or policy is important for guiding the right set of solutions—technical or otherwise."
reasonable efforts may be sufficient in some legal context: "depending on the regimes, it may not be relevant to focus research efforts on producing methods that guarantee with certainty that a particular piece of information is removed or suppressed."
What parties really hope to achieve from machine unlearning, in most cases, actually lies in output suppression. Peculiarly enough, output suppression bears more resemblence with alignment techniques and does not necessarily have anything to do with "unlearning" information.
"Output suppression (...) is perhaps a more relevant area of focus for ML research that aspires to influence policy. "
💡 3. Understanding unlearning as a generative-AI systems problem.
The tools and evaluation methods have to be implemented at AI systems level in which the AI model is embedded.
"Systems-level interventions (e.g., content filters) are an important tool for constraining outputs (Section 4.2); evaluating such interventions clearly requires systems-level analysis"
new challenge: Developers using open-weight models need to implement their own mechanisms or incorporate other available software that is intended for output suppression.
💡 4. Setting reasonable goals and expectations for unlearning.
It will take a long time for unlearning solutions to meet desired policy goals.
Role of policymakers:
Should reason about what should constitute reasonable best efforts in different contexts with respect to removing or suppressing unwanted information from models and system outputs.
💡 5. There are no general-purpose solutions to constrain generative technologies.
The expectations for machine unlearning is that it can "surgically and completely remove specific capabilities from a model while leaving everything else about the model unchanged."
Strength of generative-AI systems:
general-purpose: can be adapted to a wide range of uses and produce a wide range of useful outputs.
Due to this exact nature, machine unlearning cannot "surgically and completely remove specific capabilities from a model while leaving everything else about the model unchanged."
Machine unlearning doesn't do what you think!
Given the uncertainty about how law and policy will define privacy, copyright, and safety issues, what should current machine learning research aim to address?
Research not compliant with future law/policy can not be directly used.
This paper argues that machine unlearning on its own is not enough to regulate model behaviour and human actions. What are the other feasible and effective actions that the research community and governments, both individually and cooperatively, should devise?
Plug-and-Play style machine unlearning (or strictly put, output suppression) software for individual developers to address the challenge in Takeaway 3?
🤔 Along with suppressing output, considering machine ununlearning, we should be aware of the inputs. Identifying adversarial prompts based on the content of the prompt, the context of the prompt, and the user's characteristics is a must!
The main issue seems to be about "how inclusive to be in regards to both data removal and suppression" and "how to identify data for removal." This is very tricky. We face trade-offs between a high-performing model/large-scale training and policy goals. We also have to deal with the cost (and in some cases, infeasibility) of identifying target data. Even the court does not have a set notion of "substantial similarity"! Personal thoughts?
🍪 For more papers on machine unlearning, check out Awesome Machine Unlearning!