Datasets
Dataset Overview
This dataset supports our ASONAM 2025 study on the authenticity of LinkedIn profiles. It contains 4,200 profiles distributed as follows:
1,800 Legitimate LinkedIn Profiles (LLPs)
600 Manual Fake LinkedIn Profiles (FLPs)
1,200 GPT-3.5-Generated Profiles (GPT3.5Ps)
600 GPT-4-Generated Profiles (GPT4Ps)
It is the first dataset designed to test profile detection models against real, manually faked, and LLM-generated adversarial examples.
Profile Categories
1. Legitimate LinkedIn Profiles (LLPs)
Count: 1,800
Source: Public LinkedIn profiles, visible without login
Includes: Name, Job Titles, Experience, Education, About, Skills, Followers
2. Manual Fake Profiles (FLPs)
Count: 600
Collected from: User reports, hashtag anomalies, fake org listings
Features: Content reuse, inflated roles, buzzwords, cloned bios
3. GPT-3.5 Fake Profiles (GPT3.5Ps)
Count: 1,200
Generated using zero-shot prompting
Sections include: Job Titles, Companies, Education, About
Quality: Reasonable structure, less coherence than GPT-4
4. GPT-4 Fake Profiles (GPT4Ps)
Count: 600
Generated using few-shot prompting with real LLPs
Enforced diversity: Region, industry, role, org type
Includes all standard fields; highly realistic output
Data Structure and Features
Each profile includes both textual and numeric information:
Sections: About, Experience, Education
Subsections: Titles, companies, durations, institutions
Numerical features: Experience count, skill count, followers, connections
All profiles are consistently structured and formatted for model ingestion or manual inspection.
Similarity Metrics and Preprocessing
Cosine Similarity
We measured textual similarity between generated and legitimate profiles: