Abstract
Synthetic data is often positioned as a solution to replace sensitive fixed-size data sets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this tutorial, we survey the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area
Schedule
We will have a 15 min coffee break and two short 5 min breaks. Approximate breakdown of material (details in survey paper):
Introduction to Synthetic Data, Background & Preliminaries
Synthetic Data as a Privacy Enhancing Technology
Desiderata for Synthetic Data
Social and Legal Reasons for using Synthetic Data
Marginal Based Methods
Early Statistical Approaches -- Data Swapping & SMOTE
Probabilistic Graphical Models
Early Marginal Based Approaches -- Naïve Bayes & MWEM.
PrivBayes: A Middle Way
Richer Models -- Private-PGM, MST, PrivMRF, PrivSyn
State-of-the-Art -- Select-Measure-Generate Paradigm, RAP++ & AIM
Deep Learning Based Methods
Generative Adversarial Networks -- CTGAN, TVAE etc.
Extensions to GANs -- PATE-GAN, GEM
Recent Advances -- Diffusion-based Models
Deep Learning vs Marginal-based Methods
Privacy Attacks & Defenses on Synthetic Data
Privacy Analysis Setup -- Threat Models & Privacy Risk Measurement
Techniques for Privacy Attacks & Risk Analysis
Advanced Topics & Open Problems
Additional Modalities of Synthetic Data -- Graphs, Text, Images & Videos
Distributed Synthetic Data Generation
Pragmatic Privacy Considerations in Differentially Private Synthetic Data
Synthetic Data Generation Frameworks
Reference Materials
This webpage is a companion to the tutorial presented at KDD 2025 and VLDB 2025. We intend to share tutorial materials here including slides, videos (if any), survey paper, and links to references & libraries. More details will be posted in the coming weeks.
Survey Paper: link
Organizers
Graham Cormode
Research Scientist at Meta & Professor at the University of Warwick
🌐 : webpage
🎓 : google scholar
Samuel Maddock
Research Scientist Intern at Meta & PhD Candidate at the University of Warwick
🌐 : webpage
🎓 : google scholar