Humanizing Machine-Generated Content:
Evading AI-Text Detection through Adversarial Attack
Ying Zhou, Ben He∗ , Le Sun*
LREC-COLING 2024
Ying Zhou, Ben He∗ , Le Sun*
LREC-COLING 2024
🍪Short Summary :
Proposed Adversarial Detection Attack on AI-Text (ADAT) task to address the need for detectors resilient to attacks.
Designed HMGC framework to generate adversarial texts by applying pertubation to important tokens based on token gradient and perplexity.
Explores how dynamic and iterative adversarial learning can strengthen detectors against adversarial attacks.
#llm_generated_text_detection #adversarial_attack #adversarial_learning
THE VULNERABILITIES OF CURRENT AI-TEXT DETECTORS TO PARAPHRASING
🤔🧑🎓Let's think of an university student who is not supposed to use ChatGPT to write her essay but wants to. Would she A) submit an essay written completely by ChatGPT without any modifications or B) generate an essay with ChatGPT but "add her own touches" to avoid being punished? Generally, the answer would be B. Such texts that are tailored (or "humanized") to evade detection are called "adversarial texts."
Detecting AI-text is crucial to address rising ethical issues in LLMs such as false information, hallucination, and academic integrity. In the recent 2–3 years, there has been active research around detecting machine-generated text such as DetectGPT (Mitchell et al., ICML 2023).
However, there is a gap between previous research and real-life scenarios. Previous research rarely take adversarial texts into account (case A), while in real-life, people would try to evade detection by adding perturbations (case B). Existing detection methods lack robustness and is easily compromised with paraphrasing attacks.
💡We need to explore how we can make robust detectors by exploring adversarial attacks in real-world dynamic scenarios!
1. EXISTING AI-TEXT DETECTION METHODS
Statistical: Zero-shot detection using statistics such as entropy, perplexity, etc.
Classifier-based: Neural classifier trained from supervised data. a.k.a. model-based.
Watermarking: Embedding imperceptive patterns into the generated text. Of precautionary nature compare to the other two methods.
🍪 If interested, check out Github Awesome Paper List for more information!
2. ADVERSARIAL ATTACKS
RADAR (Hu et al., NeurIPS 2023): Adversarial training of the detector and the paraphraser.
Hu, X., Chen, P. Y., & Ho, T. Y. (2023). RADAR: Robust AI-Text Detection via Adversarial Learning. Advances in Neural Information Processing Systems, 36, 15077-15095.
⚔️Paraphraser: Generates adversarial contents. The quality of the paraphraser is enhanced as it receives feedback from the detector.
🛡️Detector: Aims to detect the authorship of the text. Robustness is fortified by learning from adversarial examples generated by the paraphraser.
Difference with HMGC: HMGC examines whether the detector can continue to learn in dynamic scenarios with multiple rounds of attack.
Task: ADVERSARIAL DETECTION ATTACK ON AI-TEXT (ADAT)
Framework: HUMANIZING MACHINE-GENERATED CONTEXT (HMGC)
Zhou, Y., He, B., & Sun, L. (2024). Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack. LREC-COLING.
To address the need for adversarial LLM-generated text detection, this paper is the first to propose a rigorously defined adversarial text detection task, ADAT. The paper characterizes what a successful adversarial attack is and establishes two attack settings for this task.
DEFINITION: Successful Adversarial Attack
We say an adversarial attack is successful if it satisfies all of the following:
the original text is classified as AI-written
the new text is classified as human-written
the original text and the new text must be similar, measured by the similarity distance evaluation function.
Formally defined, for a given threshold for a text to be classified as AI-written,
ATTACK SETTINGS: White-Box & Black-Box
White-Box Scenario: The attacker knows internal information about the victim detector such as parameters, gradients, training data, etc.
(Decision-Based) Black-Box Scenario: The attacker only has access to the output of the victim model.
⭐ Closer to real-life situations.
cf. two types of black-box scenario
score-based black box: The attacker has access to "how confident" the detector is about its decision.
decision-based black box: The attacker has access to the binary predictions.
This paper uses HC3 (Guo et al., 2023) for white-box attack and CheckGPT (Liu et al., 2023) for black-box attack. HC3 consists of QA data and CheckGPT comprises news, essay, and research data.
HMGC is a 5-step framework that generates adversarial text given a target text and a victim detector model.
Idea: The attacker keeps "humanizing" the text to fool the detector (Procedures 3-5) until finally, the detector is fooled.
Procedure 1. Train Surrogate Model
white-box: use the released HC3 detector
black-box: make a proxy detection model by distilling the predictions of the black-box model
Procedure 2. Pre-attack Assessment
Check whether the detector classifies the current sample as human-written. In other words, collect data on the sample prior to the attack.
The data is not used in the rest of the framework. Though not specified, seems to be a procedure for the mere purpose of comparing the affect of HMGC after the attack.
Procedure 3. Word Importance Ranking
Assign "Importance" to each token.
💡Intuition: As perturbing more important tokens can have significant impact on the text and consequently the detection results, we must identify which tokens are more important.
The Word Importance is calculated with gradients and perplexity, weighted by alpha.
Gradients
💡Higher gradients means they have greater impact on the result.
Perplexity
Perplexity Importance is measured by the diffence in perplexity before and after the i-th token is removed.
💡Previous work as well as this paper (see Section 4.2 Experimental Results) highlights that perplexity is a core indicator that differentiates human and AI-text.
Procedure 4. Mask Word Substitution
Selectively replace the most important (at most) K number of tokens with outputs from an encoder-based model, with K being the max number of tokens that can be replaced.
Step 1. Sort in descending order of importance.
Step 2. Replace the top K tokens with [MASK].
Step 3. Use an encoder-based model to predict generate candidates. We generate multiple candidates per token.
Step 4. With greedy search strategy, replace [MASK] tokens with the candidate words.
Note that in Step 4, we selectively replace the tokens only when replacing it will make it more "human."
Procedure 5. Post-attack Checking
Ensure that the humanized text and the original text are relatively semantically similar by disregarding samples that don't meet the checking criterion.
🍪Recall Dis() in ADAT successful adversarial attack definition!
Checking is done according to the following criteria:
POS Constraint: Candidate words must have same POS.
Maximum Perturved Ratio Constraint: The proportion of words replaced cannot exceed a certain threshold.
USE (Universal Sentence Encoder) Constraint: the distance between the substitute word and the original text cannot be too far apart. If so, the word may tamper with the semantics.
METRICS: Attack Performance Measures
AUC-ROC: the area under the receiver operating characteristic curve
A higher AUC_ROC score signals better detection performance.
🍪 For more details, check out this informative Google Developers post!
Confusion Matrix
Positive Predictive Value (PPV): Out of the texts classified as humans, how many did the detector get right?
True Negative Rate (TNR): Out of the whole AI-texts, how many did the detector classify correctly?
∆Acc: The decrease in TNR, i.e., the decrease in detection accuracy on machine-generated samples after attack.
BASELINES
Word-level perturbation: replaces word with synonyms
baselines: WordNet, BERT MLM predictions
Sentence-level perturbation: paraphrase or rewrite sentences, generally with seq2seq model
baselines: introduce irrelevant sentences, employing BART to rewrite random sentences
Full-text rewriting perturbation: utilize a model to rewrite the whole text
baselines: back translation (ENG➡️GER➡️ENG), prompting LLaMA-2 to rewrite, DIPPER (SOTA paraphraser, Krishna et al., NeurIPS 2023)
💡1. Detectors are vulnerable to adversarial attack.
Zhou, Y., He, B., & Sun, L. (2024). Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack. LREC-COLING.
In both white-box and black-box settings, HMGC shows high attack performance.
Attacks are more successful in white-box setting, as the attacker utilizes precise internal information.
AUC drops significantly 99.63% (Before Attack) ➡️ 51.06% (HMGC)
∆Acc surged by 97.29%, meaning that the model's misclassification rate for AI-text increased by 97.29% after the attack.
In black-box setting, HMGC still proves effective.
∆Acc for the optimal baseline 24.24% (MLM Syn) ➡️ 46.35% (HMGC)
💡2. Training method of the detector significantly influences its robustness.
Zhou, Y., He, B., & Sun, L. (2024). Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack. LREC-COLING.
Recall that higher ∆Acc means that the attack was more successful, or in other words, that the detector was more easily evaded. ∆Acc of the attacks on CheckGPT is higher than that of HC3, meaning that the CheckGPT model was easier to fool. Let us examine the differences between these two models.
Comparison between the training method of HC3 and CheckGPT models (detectors):
harder-to-fool HC3 Model: full parameter fine-tuning, RoBERTa-based
easier-to-fool CheckGPT model: only trained the top-level LSTM with frozen RoBERTa
💡3. Adversarial learning can effectively enhance detector performance. (dynamic setting)
As number of rounds of dynamic adversarial learning increase the detector's robustness increases.
Is this without cost? No - Increased duration for the time taken to perform the attacks.
When does the detector become strong enough?
Up to round 3, the detector is relatively vulnerable.
Equilibrium is reached after roughly 7 rounds of attacks, i.e. iterative learning from 50k adversarial examples.
💡Ablation Studies: Word perplexity as a core component for successful attack.
This can be established by 1. Ablation result of PPL and 2. Ablation results of other module—the relationship between ∆Acc and ∆ppl
Zhou, Y., He, B., & Sun, L. (2024). Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack. LREC-COLING.
In white-box setting, there are only minor fluctuations
In black-box setting, without model perplexity, the attack success rate (most notable by ∆Acc) decreases the most.
⏩effectiveness of the Perplexity Word Importance
In the ablation results of other modules (USE, MPR, POS), attack success rate and language perplexity of the humanized text is unambiguously proportional.
⏩another strategy for attack: increase text perplexity with external noise
Is HMGC truly close to "real-life dynamic scenarios"? Are the detectors trained from HMGC as robust as the authors claim?
🤔 Arguably, HMGC's is a privileged black-box setting in that we know what the victim model will be. Realistically, there are cases where we would not know. For example, if a student is to hand in an essay and a professor is to use a detector to detect whether the essay was human-written, the professor would not disclose what kind of detector he would use.
Furthermore, an even more critical problem is in the training of the proxy model for black-box attack. When the authors train a proxy model to distill the behaviour of the attack model, they utilize data known to be used as training data. In a strict sense, this is not a black-box attack. Additional experiments using a proxy model that is trained on other data is needed to support the claims made in this paper.
Are the comparisons with the baselines fair or sufficient?
🤔 Arguably, no. This is because HMGC was extremely tailored specifically for HC3 and CheckGPT, while other baselines are not.
🤔 The metrics used seem insufficient and questionable for 2 reasons.
1. Would the same outcomes be reproduced in settings where the detector is also paraphrasing-based?
This paper only examines the success rates of HMGC text on model-based detectors. Perhaps HMGC targetting model-based detectors only (and not watermarking or paraphrasing-based detectors) is an implicit assumption throughout the whole paper. As HMGC employs an LM to perturb tokens in the text, HMGC might not be successful in attacking paraphrasing-based detectors. There is no discussion of this.
🍪 parahrasing-based detection: BARTScore-CNN (EMNLP 2023), RAIDAR (ICLR 2024)
2. There is no discussion on human text.
🤔 Other works report AUROC with the condition that the misclassification of human text as machine-authored text (FPR in most literature, FNR for this paper) is low. Considering FPR is critical. This is because for a machine-generated text detector to be reliably deployed in real-life scenarios, human text should not be misclassified. Unlike other works, this paper reports AUC-ROC without any mention of FNR. This raises doubts about whether HMGC is truly robust and applicable as the paper claims.
Observation: Previous works have shown that it is more effective to use longer-level perturbations. (e.g., Krishna et al., NeurIPS 2023 advocates for perturbing 3 sentences at a time rather than 1.) The strict constraints in HMGC leave no room for even a simple change of order in the sentence, nor any phrase-level paraphrasing. It is interesting how only using word-level perturbations with a POS constraint achieves high attack success rates. Observing which tokens lead to certain detection results may provide valuable insights into the differences between human and machine-authored text.
MLM Syn performs extremely well when we take into account the short duration it needs. Why would that be?
What kind of training would be effective in making a strong detector?