Implementation details

Preprocessing & Feature Engineering

Each profile was parsed and normalized into structured sections. Incomplete or malformed records were filtered. Specific cleaning steps included:

Feature vector:

Total dimensionality: 167

Embedding Models & Dimensionality Reduction

We evaluated six pre-trained transformers. Four were retained based on explained variance and robustness:

Each section (e.g., About) was embedded separately and then reduced via PCA to 150 components using the training set only.

This reduced LLM-specific redundancy and improved the generalization of the downstream classifier.

Section Tag Embeddings (STE)

Instead of fine-grained subsection modeling, we adopted Section Tag Embeddings (STE):

This approach filtered out formatting biases and emphasized semantic substance over section headers.

Classifier Selection & Hyperparameter Tuning

We evaluated six classifiers; only two passed our robustness and calibration thresholds:

Optimization strategy:

All tuning was done on LLPs and FLPs only (no LLM fakes) to avoid unfair generalization.

Training Scenarios & Evaluation

We evaluate detection robustness under four progressively adversarial training setups. For clarity, training and test splits are presented separately below.