Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)


Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary & LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.


Krishnaram Kenthapadi (LinkedIn, USA)

Ilya Mironov (Google, USA)

Abhradeep Guha Thakurta (UC Santa Cruz, USA)

Tutorial Logistics

Venue/Time: 9:00am - 12:30pm on Monday, February 11, 2019 in Room 112

Tutorial Outline and Description

The tutorial will cover the following topics (please see the outline below for the depth within each topic).

  • Privacy breaches and lessons learned
  • Key privacy regulations and laws
  • Evolution of privacy techniques
  • Differential privacy: definition and techniques
  • Privacy techniques in practice: Challenges and Lessons Learned
  • Apple’s differential privacy deployment for iOS
  • Google’s RAPPOR
  • LinkedIn Salary: Privacy Design
  • Open source privacy tools
  • Commercial tools
  • Open problems

The field of privacy-preserving data mining and machine learning is at an inflexion point, where there is a tremendous need from the perspective of web-scale applications and legal requirements and practical and scalable privacy-preserving approaches are becoming available as evidenced by adoption at companies such as Apple, Google, LinkedIn, and Uber. Given the renewed focus on privacy in light of recent data breaches and new regulations such as GDPR, this tutorial will be timely and of key relevance to the data mining research and practitioner community.

The tutorial presenters have extensive experience in the field of privacy-preserving data mining and machine learning, having contributed to both the theoretical foundations and the applications of privacy techniques in practice. This tutorial incorporates the lessons learned by the presenters while building and deploying privacy-preserving systems in practice (LinkedIn Salary, Google’s RAPPOR, and Apple’s differential privacy deployment for iOS respectively), and consists of a combination of the foundational overview (motivation for rigorous privacy techniques, key privacy regulations and laws, privacy definitions and techniques), case studies (early applications such as privacy in web search logs, web-scale deployments at Apple, Google, and LinkedIn), and available tools to provide a substantial learning experience on this topic.

Outline (including list of key references)

The tutorial will be split into theory (background / privacy breaches / evolution of techniques / differential privacy / …) for the first half (1.5 hours) and practice (Apple’s differential privacy deployment in iOS, Google’s RAPPOR, LinkedIn Salary, etc) for the second half (1.5 hours).

Presenter Bios

The presenters have extensive experience in the field of privacy-preserving data mining, including the theory and application of privacy techniques. They have published several well-cited papers on privacy. For instance, Krishnaram Kenthapadi and Ilya Mironov co-authored the second paper on differential privacy (EUROCRYPT 2006), which introduced the notion of (\epsilon, \delta)-differential privacy. Abhradeep Guha Thakurta’s research on differential privacy and machine learning has appeared in reputed venues such as STOC, FOCS, NIPS, KDD, and ICML. They also have rich experience applying privacy techniques in practice (LinkedIn Salary and LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting, Google’s RAPPOR, and Apple’s differential privacy deployment for iOS & macOS respectively), and have been providing technical leadership on AI and privacy at their respective organizations.

Krishnaram Kenthapadi is part of the AI team at LinkedIn, where he leads the fairness, transparency, explainability, and privacy modeling efforts across different LinkedIn applications. He also serves as LinkedIn's representative in Microsoft's AI and Ethics in Engineering and Research (AETHER) Committee. He shaped the technical roadmap and led the privacy/modeling efforts for LinkedIn Salary product, and prior to that, served as the relevance lead for the LinkedIn Careers and Talent Solutions AI team, which powers search/recommendation products at the intersection of members, recruiters, and career opportunities. He also led the development of LinkedIn's PriPeARL system for privacy-preserving analytics. Previously, he was a Researcher at Microsoft Research Silicon Valley, where his work resulted in product impact (and Gold Star / Technology Transfer awards), and several publications/patents. He received his Ph.D. degree in Computer Science from Stanford University in 2006, for his thesis, "Models and Algorithms for Data Privacy". He serves regularly on the program committees of KDD, WWW, WSDM, and related conferences, and co-chaired the 2014 ACM Symposium on Computing for Development. He received the CIKM best case studies paper award, the SODA best student paper award, and the WWW best paper award nomination for his work on privacy-preserving publishing of web search logs. He has published 35+ papers, with 2500+ citations and filed 130+ patents.

Ilya Mironov is a Staff Research Scientist at Google Brain working in privacy of machine learning. Prior to joining Google, he was a Researcher at Microsoft Research Silicon Valley, where he co-authored seminal works on applications and implementations of differential privacy. He was a keynote speaker at several recent events, such as Private and Secure Machine Learning workshop at ICML 2017, Theory and Practice of Differential Privacy workshop co-located with ACM CCS 2017, and NordSec 2017. He graduated with a Ph.D. in Computer Science from Stanford in 2003.

Abhradeep Guha Thakurta is an Assistant Professor in the Department of Computer Science at University of California Santa Cruz. His primary research interest is in the intersection of data privacy and machine learning. He focuses on demonstrating, both theoretically and in practice, that privacy enables designing better machine learning algorithms, and vice versa. Prior to joining the academia, Abhradeep was a Senior Machine Learning Scientist at Apple, where he was the primary algorithm designer in deploying local differential privacy on iOS 10, and macOS Sierra. This project is possibly one of the largest industrial deployments of differential privacy and has resulted in at least 200+ press articles, a follow-up paper at NIPS 2017, and four granted patents. He won the Early Career Award in 2016 from Pennsylvania State University for the same project. Before Apple, Abhradeep worked as a post-doctoral researcher at Microsoft Research Silicon Valley and Stanford University, and then he worked as a Research Scientist at Yahoo Research. He did his Ph.D. from Pennsylvania State University.