Results and discussion

1. Baseline Performance on Manual Fakes

To establish a clean reference point, we first evaluated profile detection performance in a traditional setting, with no LLM-generated content present. The training and test sets consisted of only:

Legitimate LinkedIn Profiles (LLPs)
Manually Identified Fake Profiles (FLPs)

The following performance was achieved using Flair embeddings + XGBoost classifier:

These results demonstrate that when fake profiles are manually crafted, existing detectors perform reliably. False accept and reject rates are both below 7%, and the F1 score exceeds 97%.

However, this baseline does not reflect real-world adversarial environments, especially those involving synthetic content.

2. Breakdown Under LLM-Generated Attacks

To test the limits of existing detectors, we evaluated the baseline model on profiles generated using large language models. These profiles were not seen during training. Despite achieving high baseline accuracy on manually generated fakes, the system failed when faced with LLM-generated fakes.

Cosine Similarity with Legitimate Profiles

Baseline Performance When Attacked

When tested against these synthetic profiles, the baseline model misclassified a large fraction of them as real:

GPT3.5Ps: False Accept Rate = 40.0%
GPT4Ps: False Accept Rate = 48.2%

These results indicate that high-similarity synthetic profiles can bypass even well-performing systems. Existing detectors were never exposed to this level of realism during training.

Without exposure to LLM-generated fakes, detection systems are highly vulnerable to adversarial evasion.

3. Fix via GPT-Assisted Adversarial Training

To address the failure of baseline detectors against LLM-generated attacks, we retrained classifiers using adversarial examples. Synthetic fake profiles generated by GPT-3.5 and GPT-4 were added to the training set in controlled proportions.

False Accept Rate Before and After Retraining

Retraining on just one type of attacker improved performance against that specific generator but left the model vulnerable to others. Only the combined GPT3.5+4 regime generalized across both threat types.

The final system achieved F1 = 98.29% and reduced the false accept rate to 1.34%, outperforming all other configurations.

Adversarial retraining is not optional — it is necessary for realistic threat modeling in the age of large language models.

4. GPT-4 vs Human Benchmarks

We benchmarked GPT-4 Turbo and human annotators on a mixed-profile test set comprising real, manually created, GPT3.5-generated, and GPT4-generated LinkedIn-style profiles. Each participant classified 360 profiles without the aid of a model.

Performance Comparison

Even with few-shot prompting, GPT-4 misclassified 1 in 4 synthetic fakes. Zero-shot performance was worse, allowing nearly 44% of fakes to pass as legitimate.

Human annotators performed significantly worse, struggling with both subtle fakes and legitimate profiles. Their overall F1 was below 50% in the blind setting.

In contrast, our adversarial retrained classifier achieved F1 > 98% with FAR < 2%.

5. Feature Insights & Calibration

We conducted ablation studies to understand the relative contribution of different features and tested whether better calibration led to stronger model robustness under LLM-generated attacks.

Ablation: Feature Type vs. Robustness

STE Text Embeddings Only: High performance on manual fakes (F1 ≈ 96%) but dropped to 57–81% on LLM fakes.
Numerical Features Only: Moderate baseline F1 (~95%) but more resilient to LLM attacks (F1 ≈ 78–80%).
Combined (Text + Numeric): Highest performance across all test cases. Retained >97% F1 even under combined attack scenarios.

These results suggest that while LLMs are strong at mimicking human writing, they struggle to emulate platform-level behavioral signals (e.g., skill counts, endorsement ratios, connection networks). Numerical features captured those inconsistencies.

Calibration Effects

We evaluated model calibration using the Brier score. Well-calibrated models made sharper, more reliable predictions and suffered lower false accept rates.

DeBERTa + CatBoost: Brier Score dropped from 0.135 → 0.063 after retraining
Corresponding FAR: Dropped from 52.2% → 2.78%

Calibration complements feature design: even when feature sets are strong, a poorly calibrated model can overtrust its predictions. Our results show both matters.

Ablation Study

We tested models with only textual embeddings (STE), only numerical features, and both combined.

Conclusion: numerical features offer stability under attack; textual embeddings improve when combined with numbers.

6. Key Takeaways & Discussion

Our findings offer a clear narrative on the evolving threat of synthetic profiles and the requirements for modern detection systems:

Baseline systems are insufficient. High performance against manual fakes does not generalize. False Accept Rates of 40–52% against LLM fakes reveal the brittleness of traditional detectors.
LLM-generated profiles are more difficult to detect than manually created fakes. GPT4-generated profiles are more semantically realistic than most crowdsourced fakes. Cosine similarity to real profiles exceeded 88%.
Adversarial retraining is essential. The only consistently effective strategy was exposure to both GPT-3.5 and GPT-4 profiles during training. This cut FAR from 52.2% to 1.34% in the most difficult setting.
Our system outperforms both GPT-4 and human reviewers. Human participants misclassified nearly 40% of fakes. Even GPT-4, in few-shot mode, allowed 25% through. Our best model performed the same task with an F1 score of 98.29% and a FAR of 1.34%.
Design matters: both features and calibration. Combining textual and numerical features is more effective than either alone. Calibration further improves robustness under attack.

Future detection systems must assume that attackers have access to powerful generative models. Defenses must therefore include synthetically generated threats at training time and leverage both content and behavioral signals.

Page updated

Google Sites

Report abuse