Weak Links in LinkedIn

Datasets

Dataset Overview

This dataset supports our ASONAM 2025 study on the authenticity of LinkedIn profiles. It contains 4,200 profiles distributed as follows:

1,800 Legitimate LinkedIn Profiles (LLPs)
600 Manual Fake LinkedIn Profiles (FLPs)
1,200 GPT-3.5-Generated Profiles (GPT3.5Ps)
600 GPT-4-Generated Profiles (GPT4Ps)

It is the first dataset designed to test profile detection models against real, manually faked, and LLM-generated adversarial examples.

Profile Categories

1. Legitimate LinkedIn Profiles (LLPs)

Count: 1,800
Source: Public LinkedIn profiles, visible without login
Includes: Name, Job Titles, Experience, Education, About, Skills, Followers

2. Manual Fake Profiles (FLPs)

Count: 600
Collected from: User reports, hashtag anomalies, fake org listings
Features: Content reuse, inflated roles, buzzwords, cloned bios

3. GPT-3.5 Fake Profiles (GPT3.5Ps)

Count: 1,200
Generated using zero-shot prompting
Sections include: Job Titles, Companies, Education, About
Quality: Reasonable structure, less coherence than GPT-4

4. GPT-4 Fake Profiles (GPT4Ps)

Count: 600
Generated using few-shot prompting with real LLPs
Enforced diversity: Region, industry, role, org type
Includes all standard fields; highly realistic output

Data Structure and Features

Each profile includes both textual and numeric information:

Sections: About, Experience, Education
Subsections: Titles, companies, durations, institutions
Numerical features: Experience count, skill count, followers, connections

All profiles are consistently structured and formatted for model ingestion or manual inspection.

Similarity Metrics and Preprocessing

Cosine Similarity

We measured textual similarity between generated and legitimate profiles:

Preprocessing Workflow

Cleaning: Normalized text and removed HTML artifacts
Parsing: Extracted structured sections and features
Validation: Removed profiles missing core fields (e.g., name, experience)

Access and Use Cases

GitHub Repository

Includes full dataset, profile generation prompts, preprocessing scripts, and evaluation tools.

Use Cases

Benchmarking fake profile detection systems
Training classifiers with synthetic adversaries
Studying LLM-generated deception patterns
Conducting human vs machine realism experiments
Auditing safety mechanisms on professional platforms

Page updated

Google Sites

Report abuse