Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Overview
Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data breaches and new regulations such as GDPR. In this tutorial, we will first present an overview of privacy breaches over the last two decades and the lessons learned, key regulations and laws, and evolution of privacy techniques leading to differential privacy definition / techniques. Then, we will focus on the application of privacy-preserving data mining techniques in practice, by presenting case studies such as Apple's differential privacy deployment for iOS / macOS, Google's RAPPOR, LinkedIn Salary & LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting, and Microsoft's differential privacy deployment for collecting Windows telemetry. We will conclude with open problems and challenges for the data mining / machine learning community, based on our experiences in industry.
Presenters
Krishnaram Kenthapadi (LinkedIn, USA)
Ilya Mironov (Google, USA)
Abhradeep Guha Thakurta (UC Santa Cruz, USA)
Tutorial Material
Tutorial Context: Fairness, Privacy, and Transparency by Design in AI/ML Systems
Slides (embedded below)
Tutorial Logistics
Venue/Time: 9:00am - 12:30pm on Monday, February 11, 2019 in Room 112
Tutorial Outline and Description
The tutorial will cover the following topics (please see the outline below for the depth within each topic).
- Privacy breaches and lessons learned
- Key privacy regulations and laws
- Evolution of privacy techniques
- Differential privacy: definition and techniques
- Privacy techniques in practice: Challenges and Lessons Learned
- Apple’s differential privacy deployment for iOS
- Google’s RAPPOR
- LinkedIn Salary: Privacy Design
- Open source privacy tools
- Commercial tools
- Open problems
The field of privacy-preserving data mining and machine learning is at an inflexion point, where there is a tremendous need from the perspective of web-scale applications and legal requirements and practical and scalable privacy-preserving approaches are becoming available as evidenced by adoption at companies such as Apple, Google, LinkedIn, and Uber. Given the renewed focus on privacy in light of recent data breaches and new regulations such as GDPR, this tutorial will be timely and of key relevance to the data mining research and practitioner community.
The tutorial presenters have extensive experience in the field of privacy-preserving data mining and machine learning, having contributed to both the theoretical foundations and the applications of privacy techniques in practice. This tutorial incorporates the lessons learned by the presenters while building and deploying privacy-preserving systems in practice (LinkedIn Salary, Google’s RAPPOR, and Apple’s differential privacy deployment for iOS respectively), and consists of a combination of the foundational overview (motivation for rigorous privacy techniques, key privacy regulations and laws, privacy definitions and techniques), case studies (early applications such as privacy in web search logs, web-scale deployments at Apple, Google, and LinkedIn), and available tools to provide a substantial learning experience on this topic.
Outline (including list of key references)
The tutorial will be split into theory (background / privacy breaches / evolution of techniques / differential privacy / …) for the first half (1.5 hours) and practice (Apple’s differential privacy deployment in iOS, Google’s RAPPOR, LinkedIn Salary, etc) for the second half (1.5 hours).
- Privacy breaches and lessons learned
- Re-identification by Linking: Medical records
- L. Sweeney. k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
- Re-identification by Linking: AOL search log
- De-anonymizing Social Networks
- Narayanan, Arvind, and Vitaly Shmatikov. "De-anonymizing social networks." 30th IEEE Symposium on Security and Privacy, 2009.
- De-anonymizing Netflix Data
- Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." IEEE Symposium on Security and Privacy, 2008.
- Privacy violations using microtargeted ads
- Korolova, Aleksandra. "Privacy Violations Using Microtargeted Ads: A Case Study." Journal of Privacy and Confidentiality 3.1 (2011): 27-49.
- De-anonymizing Web Browsing Data with Social Networks
- Re-identification by Linking: Medical records
- Key privacy regulations and laws
- GDPR (General Data Protection Regulation), taking effect in May, 2018:
- An opinion from EU regulators about anonymity: http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
- IDPC privacy regulation
- HIPAA
- Evolution of privacy techniques
- Differential privacy: definition and techniques
- Original definitions
- \epsilon-differential privacy
- Dwork, C., Mcsherry, F., Nissim, K., & Smith, A. Calibrating noise to sensitivity in private data analysis. TCC, 2006.
- (\epsilon, \delta)-differential privacy
- Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. . Our Data, Ourselves: Privacy Via Distributed Noise Generation. EUROCRYPT, 2006.
- \epsilon-differential privacy
- Extensions / variants for DP over time, local sensitivity, elastic sensitivity, Renyi differential privacy, etc
- Dwork, Cynthia, and Aaron Roth. "The algorithmic foundations of differential privacy." Foundations and Trends in Theoretical Computer Science 9.3–4 (2014): 211-407.
- Initial applications on web-scale datasets:
- Privacy for search logs
- Korolova, A., Kenthapadi, K., Mishra, N., & Ntoulas, A. Releasing search queries and clicks privately. WWW 2009.
- PINQ
- McSherry, Frank D. "Privacy integrated queries: an extensible platform for privacy-preserving data analysis." SIGMOD 2009.
- Privacy for search logs
- Original definitions
- Privacy techniques in practice: Challenges and Lessons Learned
- Balancing the trade-offs between utility, privacy, and resources
- Utility, privacy trade-offs have been studied in the literature, but handling the trade-offs with computation and communication resources is unique to industrial deployment.
- Current industrial deployments are highly distributed
- Each device (e.g., a cell phone) holds a single data point, and the server computes aggregate over differentially private information from the devices
- Server computation is cheap, but client computation is expensive as they are low-power devices
- Client / server communication is expensive, and hence both amount of communication and the number of rounds of communication has to be minimized
- Lessons learned: Resource constraints drive the algorithmic design as much as the privacy utility trade-offs. In fact, often these constraints enable design theoretically optimal algorithms (Practical locally private heavy hitters (https://arxiv.org/abs/1707.04982)).
- Bassily, Raef, Uri Stemmer, and Abhradeep Guha Thakurta. "Practical locally private heavy hitters." NIPS 2017.
- Balancing the trade-offs between utility, privacy, and resources
- Apple’s differential privacy deployment for iOS
- Differential privacy deployment for iOS (Apple; https://images.apple.com/privacy/docs/Differential_Privacy_Overview.pdf)
- Algorithmic description: Patents (https://www.google.com/patents/US9594741, https://www.google.com/patents/US9705908), follow-up paper at NIPS 2017 (https://arxiv.org/abs/1707.04982) and at https://machinelearning.apple.com/docs/learning-with-privacy-at-scale/appledifferentialprivacysystem.pdf.
- Optimal algorithm (in terms of error, storage, computation and communication) for locally) differentially private heavy hitters.
- Bassily, Raef, Uri Stemmer, and Abhradeep Guha Thakurta. "Practical locally private heavy hitters." NIPS 2017.
- Runs on All iOS and macOS devices. The technology has been used in learning new words from user keyboards, learning health analytics, device telemetry etc.
- Google’s RAPPOR
- https://security.googleblog.com/2014/10/learning-statistics-with-privacy-aided.html
- Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. "RAPPOR: Randomized aggregatable privacy-preserving ordinal response." ACM CCS, 2014.
- https://github.com/google/rappor
- LinkedIn Salary
- https://www.linkedin.com/salary
- Kenthapadi, Krishnaram, Ahsan Chudhary, and Stuart Ambler. "LinkedIn Salary: A System for Secure Collection and Presentation of Structured Compensation Insights to Job Seekers." IEEE PAC 2017 (https://arxiv.org/abs/1705.06976)
- Challenges in applying differential privacy based techniques
- Kenthapadi, K., Ambler, S., Zhang, L., and Agarwal, D. Bringing Salary Transparency to the World: Computing Robust Compensation Insights via LinkedIn Salary. ACM CIKM , 2017 (best case studies paper award; https://arxiv.org/abs/1703.09845)
- LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting
- Krishnaram Kenthapadi, Thanh T. L. Tran, PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn, ACM CIKM 2018
- Open source privacy tools
- Google's open source project, RAPPOR for privacy-preserving data collection:
- SQL Differential Privacy tool recently open-sourced by Uber: https://github.com/uber/sql-differential-privacy
- "Elastic sensitivity is an approach for efficiently approximating the local sensitivity of a query, which can be used to enforce differential privacy for the query. The approach requires only a static analysis of the query and therefore imposes minimal performance overhead. Importantly, it does not require any changes to the database. Details of the approach are available in the following paper: Johnson, N., Near, J.P. and Song, D., Towards Practical Differential Privacy for SQL Queries. https://arxiv.org/abs/1706.09479".
- "The framework for implementing elastic sensitivity is designed to perform dataflow analyses over complex SQL queries. It provides an abstract representation of queries, plus several kinds of built-in dataflow analyses tailored to this representation. This framework can be used to implement other types of dataflow analyses and will soon support additional differential privacy mechanisms for SQL."
- Andy Greenberg, Uber's New Tool Lets Its Staff Know Less About You, Wired (July 2017). https://www.wired.com/story/uber-privacy-elastic-sensitivity
- Commercial tools
- AirCloak (https://www.aircloak.com):
- https://www.aircloak.com/downloads/Aircloak-One-Pager.pdf
- Aircloak is an analytics anonymization product designed for SQL queries. It examines each SQL query, identifies the filter condition(s), and either adds noise based on the filter condition(s) in the case of aggregate queries, or chooses to suppress the query result in case the number of rows in the result is below a threshold. The underlying database itself is not modified since the noise is added to the result of each query by Aircloak (in other words, Aircloak uses "output perturbation"). Under the hood, the query is rewritten by Aircloak. However only a subset of SQL constructs are currently supported (most common ones such as select, min, max, count, average, where, joins are supported; more constructs may be supported in future releases).
- AirCloak (https://www.aircloak.com):
- Open problems
Presenter Bios
The presenters have extensive experience in the field of privacy-preserving data mining, including the theory and application of privacy techniques. They have published several well-cited papers on privacy. For instance, Krishnaram Kenthapadi and Ilya Mironov co-authored the second paper on differential privacy (EUROCRYPT 2006), which introduced the notion of (\epsilon, \delta)-differential privacy. Abhradeep Guha Thakurta’s research on differential privacy and machine learning has appeared in reputed venues such as STOC, FOCS, NIPS, KDD, and ICML. They also have rich experience applying privacy techniques in practice (LinkedIn Salary and LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting, Google’s RAPPOR, and Apple’s differential privacy deployment for iOS & macOS respectively), and have been providing technical leadership on AI and privacy at their respective organizations.
Krishnaram Kenthapadi is part of the AI team at LinkedIn, where he leads the fairness, transparency, explainability, and privacy modeling efforts across different LinkedIn applications. He also serves as LinkedIn's representative in Microsoft's AI and Ethics in Engineering and Research (AETHER) Committee. He shaped the technical roadmap and led the privacy/modeling efforts for LinkedIn Salary product, and prior to that, served as the relevance lead for the LinkedIn Careers and Talent Solutions AI team, which powers search/recommendation products at the intersection of members, recruiters, and career opportunities. He also led the development of LinkedIn's PriPeARL system for privacy-preserving analytics. Previously, he was a Researcher at Microsoft Research Silicon Valley, where his work resulted in product impact (and Gold Star / Technology Transfer awards), and several publications/patents. He received his Ph.D. degree in Computer Science from Stanford University in 2006, for his thesis, "Models and Algorithms for Data Privacy". He serves regularly on the program committees of KDD, WWW, WSDM, and related conferences, and co-chaired the 2014 ACM Symposium on Computing for Development. He received the CIKM best case studies paper award, the SODA best student paper award, and the WWW best paper award nomination for his work on privacy-preserving publishing of web search logs. He has published 35+ papers, with 2500+ citations and filed 130+ patents.
Ilya Mironov is a Staff Research Scientist at Google Brain working in privacy of machine learning. Prior to joining Google, he was a Researcher at Microsoft Research Silicon Valley, where he co-authored seminal works on applications and implementations of differential privacy. He was a keynote speaker at several recent events, such as Private and Secure Machine Learning workshop at ICML 2017, Theory and Practice of Differential Privacy workshop co-located with ACM CCS 2017, and NordSec 2017. He graduated with a Ph.D. in Computer Science from Stanford in 2003.
Abhradeep Guha Thakurta is an Assistant Professor in the Department of Computer Science at University of California Santa Cruz. His primary research interest is in the intersection of data privacy and machine learning. He focuses on demonstrating, both theoretically and in practice, that privacy enables designing better machine learning algorithms, and vice versa. Prior to joining the academia, Abhradeep was a Senior Machine Learning Scientist at Apple, where he was the primary algorithm designer in deploying local differential privacy on iOS 10, and macOS Sierra. This project is possibly one of the largest industrial deployments of differential privacy and has resulted in at least 200+ press articles, a follow-up paper at NIPS 2017, and four granted patents. He won the Early Career Award in 2016 from Pennsylvania State University for the same project. Before Apple, Abhradeep worked as a post-doctoral researcher at Microsoft Research Silicon Valley and Stanford University, and then he worked as a Research Scientist at Yahoo Research. He did his Ph.D. from Pennsylvania State University.