Implementation details

Preprocessing & Feature Engineering

Each profile was parsed and normalized into structured sections. Incomplete or malformed records were filtered. Specific cleaning steps included:

Lowercasing, punctuation normalization, and whitespace trimming
Section separation: About, Experience, Education, Skills, etc.
Token-level noise removal: emojis, encoding glitches, HTML remnants
Default placeholders for missing but required fields

Feature vector:

Textual: Section-based embeddings → 512-d → reduced to 150-d via PCA
Numerical: 17 handcrafted features (e.g., #experiences, #skills, avg job duration)

Total dimensionality: 167

Embedding Models & Dimensionality Reduction

We evaluated six pre-trained transformers. Four were retained based on explained variance and robustness:

RoBERTa (base) – high semantic coherence
ModernBERT – efficient for long, structured documents
DeBERTa-v3 – retained variance across both formal and informal bios
Flair – useful for short, noisy sections (e.g., Skills)

Each section (e.g., About) was embedded separately and then reduced via PCA to 150 components using the training set only.

This reduced LLM-specific redundancy and improved the generalization of the downstream classifier.

Section Tag Embeddings (STE)

Instead of fine-grained subsection modeling, we adopted Section Tag Embeddings (STE):

Each section (e.g., About, Experience) is embedded as one block
Tag embeddings (e.g., "About") were subtracted from section text embeddings
All section vectors were then averaged into a single unified embedding

This approach filtered out formatting biases and emphasized semantic substance over section headers.

Classifier Selection & Hyperparameter Tuning

We evaluated six classifiers; only two passed our robustness and calibration thresholds:

XGBoost: Brier score = 0.0245
CatBoost: Brier score = 0.0240

Optimization strategy:

Phase 1: Bayesian optimization (Tree Parzen Estimator) on a 70/30 split
Phase 2: Genetic algorithms sweep on top 15 configs (50 individuals × 3 generations)
Validation: 5-fold cross-validation on combined dev set

All tuning was done on LLPs and FLPs only (no LLM fakes) to avoid unfair generalization.

Training Scenarios & Evaluation

We evaluate detection robustness under four progressively adversarial training setups. For clarity, training and test splits are presented separately below.

Evaluation Metrics

F1 Score – Harmonic mean of precision and recall
False Accept Rate (FAR) – % of fake profiles marked legitimate
False Reject Rate (FRR) – % of real profiles marked fake

Test sets were held out, with LLM-based profiles introduced only during evaluation, unless they were explicitly part of the retraining process.

LLM & Human Evaluation + Reproducibility

LLM Benchmark

GPT-4 Turbo evaluated 360 profiles (half fake, half real):

Zero-shot: F1 = 71.3%, FAR = 43.9%
Few-shot (4 examples): F1 = 85.7%, FAR = 25.0%

Human Study

30 participants × 15 profiles each
Aggregate F1 = 58.9%
FAR = 38.7%, FRR = 18.2%

Code & Prompts

Full codebase: data loaders, embeddings, training, evaluation, GPT prompt generation:
GitHub Repository