Abstract
Synthetic data is often positioned as a solution to replace sensitive fixed-size data sets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this tutorial, we survey the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area
Schedule
We will have a 15 min coffee break and three short 5 min breaks. Approximate breakdown of material (details in survey paper):
Introduction to Synthetic Data, Background & Preliminaries
[Small 5 min Break]
Marginal Based Methods
[Small 5 min Break]
Deep Learning Based Methods
[15 min Break]
Privacy Attacks & Defenses on Synthetic Data
[Small 5 min Break]
Advanced Topics / Extensions & Open Problems
[QA]
Reference Materials
This webpage is a companion to the tutorial presented at KDD 2025 and VLDB 2025. We intend to share tutorial materials here including slides, videos (if any), survey paper, and links to references & libraries. More details will be posted in the coming weeks.
Survey Paper: link
Slides
Organizers
Graham Cormode
Research Scientist at Meta & Professor at the University of Warwick
🌐 : webpage
🎓 : google scholar
Samuel Maddock
Research Scientist Intern at Meta & PhD Candidate at the University of Warwick
🌐 : webpage
🎓 : google scholar