Implementation details
Preprocessing & Feature Engineering
Each profile was parsed and normalized into structured sections. Incomplete or malformed records were filtered. Specific cleaning steps included:
Lowercasing, punctuation normalization, and whitespace trimming
Section separation: About, Experience, Education, Skills, etc.
Token-level noise removal: emojis, encoding glitches, HTML remnants
Default placeholders for missing but required fields
Feature vector:
Textual: Section-based embeddings → 512-d → reduced to 150-d via PCA
Numerical: 17 handcrafted features (e.g., #experiences, #skills, avg job duration)
Total dimensionality: 167
Embedding Models & Dimensionality Reduction
We evaluated six pre-trained transformers. Four were retained based on explained variance and robustness:
RoBERTa (base) – high semantic coherence
ModernBERT – efficient for long, structured documents
DeBERTa-v3 – retained variance across both formal and informal bios
Flair – useful for short, noisy sections (e.g., Skills)
Each section (e.g., About) was embedded separately and then reduced via PCA to 150 components using the training set only.
This reduced LLM-specific redundancy and improved the generalization of the downstream classifier.
Section Tag Embeddings (STE)
Instead of fine-grained subsection modeling, we adopted Section Tag Embeddings (STE):
Each section (e.g., About, Experience) is embedded as one block
Tag embeddings (e.g., "About") were subtracted from section text embeddings
All section vectors were then averaged into a single unified embedding
This approach filtered out formatting biases and emphasized semantic substance over section headers.
Classifier Selection & Hyperparameter Tuning
We evaluated six classifiers; only two passed our robustness and calibration thresholds:
XGBoost: Brier score = 0.0245
CatBoost: Brier score = 0.0240
Optimization strategy:
Phase 1: Bayesian optimization (Tree Parzen Estimator) on a 70/30 split
Phase 2: Genetic algorithms sweep on top 15 configs (50 individuals × 3 generations)
Validation: 5-fold cross-validation on combined dev set
All tuning was done on LLPs and FLPs only (no LLM fakes) to avoid unfair generalization.
Training Scenarios & Evaluation
We evaluate detection robustness under four progressively adversarial training setups. For clarity, training and test splits are presented separately below.