Raef Bassily - CSE5479: Privacy-Preserving Data Analysis

Privacy-Preserving Data Analysis

Spring 2018

Venue: Bakers 180

Time: Mondays, 12:40-2:30 PM

Course forum on Piazza: piazza.com/osu/spring2018/cse5479

About this course:

Dissemination of large volumes of personal and sensitive data has become prevalent practice in this day and age. Making such data widely available for statistical analysis and machine learning can provide a broad range of benefits. However, the results of analyses of private data can lead to devastating disclosure of sensitive information. We face two seemingly conflicting goals: gaining the benefits of machine learning based on private data, and protecting the privacy of the individuals whose data is collected. How can we achieve them both? Straightforward approaches to deal with the privacy problem such as data anonymization are at best unreliable: the last decade has seen a string of attacks that recover personal information from supposedly "anonymized" data.

The last decade has witnessed the rise of a rigorous theory to deal with this challenge. This theory is centered around a meaningful and robust mathematical definition for privacy, known as Differential Privacy. A powerful algorithmic framework for differential privacy has been developed over the years, and led to numerous practical and efficient algorithms with strong and provable privacy guarantees for various applications in machine learning, data mining, and statistics. Due to its attractive properties, differential privacy has reigned as the gold standard of statistical data privacy. Differential privacy has recently made it to the industrial domain with several successful adoptions, notably by Google and Apple.

This class will start by demonstrating the need for a rigorous privacy framework via several examples that led to high-profile privacy breaches. We will then introduce differential privacy, discuss the semantics of its guarantees, how to design algorithms that satisfy differential privacy (known as differentially private algorithms), basic properties of this notion such as closure under adaptive composition and post-processing. We will next discuss popular and powerful algorithmic techniques in the literature of differential privacy, and focus on some of the best known constructions of differentially private algorithms for several important problems in machine learning and statistical data analysis. These include algorithms that have been deployed in industry such as Google’s RAPPOR protocol and Apple’s differentially private protocols for privately learning new words, frequent emoji patterns, and other statistics from iPhone users.

Goals:

The goal of this course is to introduce students to the burgeoning area of privacy-preserving data analysis, and mainly differential privacy. This course aim to help students take up a research career in data privacy, or pursue industry positions in privacy engineering, of which there has been increasing demand especially in big corporations such as Google, Apple, Uber and many others. At the end of this course, students are expected to have a solid understanding of the foundational concepts of private data analysis, and have a good grasp of the design principles of practical and useful algorithms that provide strong and provable privacy guarantees.

Pre-requisites:

(CSE 5523 or STAT 3470) and CSE 5331

If you have not taken one of these classes, and feel confident you have the knowledge from other sources/courses, you can still enroll, but please, talk with me first.

Good mathematical background in probability, algorithms (including randomized algorithms), linear algebra and calculus is required.

Evaluation:

35% Participation (Class + Piazza)
15% Presentation of a paper (see list of suggestions below)
50% Project (also has its own presentation in the last week of the course)

Lectures

Students Presentations:

Presenter: Anupama Nandi, Advanced Techniques in Differential Privacy: Sparse-Vector and Sub-sample-and-Aggregate.
Presenter: Wei Da, Differentially Private Deep Learning.
Presenters: Jie Zhao and Ziyu Yao, Differential Privacy and Natural Language Processing.
Presenter: Brian Groenke, RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response.
Presenter: Benjamin Strauss, Membership Inference Attacks Against Machine Learning Models.

Project

Semester long project on a topic in privacy

(see possible topics and some relevant papers in the References section below).

Suggestions: implementation of differentially private algorithms for one of the following applications i) supervised learning task, e.g., training SVM classifier, or linear regression, ii) local (distributed) model, iii) streaming model, iv) deep learning application. See more specific suggestions here.
It's a group project: 2 students/group
Can be:

Implementation of a privacy-preserving algorithm/system (based on literature or new!), or
Theory: identifying an open (or new) problem in privacy and proposing a solution, or
Combination of theory and implementation

Project components:

– Mid-term check (discussion with groups during office hours)

– Final Report: a) Introduction: literature review + motivation, b) Problem statement, c) Proposed solution/implementation and Results.

– Project presentation: 20-25 min. per group in the last week of class.

References

Textbook and survey papers:

Cynthia Dwork and Aaron Roth, "The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science .
Nissim et al., "Differential Privacy: A Primer for a Non-technical Audience," (This is meant to be understood by a general audience, and may serve as a gentle introduction to differential privacy. It also sheds some light on the legal issues of privacy).
Salil Vadhan, "The Complexity of Differential Privacy" (a nice survey with some emphasis on the computational aspects of differentially private algorithms.)

Relevant papers (non-exhaustive, listed by topic):

Attacks on Privacy:

[NS06] : Narayanan and Shmatikov, Robust De-anonymization of Large Datasets: How to Break Anonymity of the Netflix Prize Dataset.
[CKNFS11] : Calandrino et al., “You Might Also Like:” Privacy Risks of Collaborative Filtering.
[Korolova11] : Korolova, Privacy Violations Using Microtargeted Ads: A Case Study.
[CCPS15] : Conti et al., TRAP: using TaRgeted Ads to unveil Google personal Profiles.
[SSSS17] : Shokri et al., Membership Inference Attacks Against Machine Learning Models.

Early papers: definitions, basic mechanisms, properties

[DN03] : Dinur, Nissim, Revealing Information while Preserving Privacy.
[DMNS06] : Dwork, McSherry, Nissim, Smith, Calibrating Noise to Sensitivity in Private Data Analysis.
[DKMMN06] : Dwork, Kenthapadi, McSherry, Mironov, Noar, Our Data, Ourselves: Privacy via Distributed Noise Generation.

More tools and algorithmic techniques

[MT07] : McSherry, Talwar, Mechanism Design via Differential Privacy.
[NRS07] : Nissim, Raskhodnikova, Smith, Smooth Sensitivity and Sampling in Private Data Analysis.
[DL09] : Dwork, Lei, Differential Privacy and Robust Statistics.
[RR10] : Roth, Roughgarden, Interactive Privacy via the Median Mechanism.
[HR10] : Hardt, Rothblum, A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis.
[DRV10] : Dwork, Rothblum, Vadhan, Boosting and Differential Privacy.
[NTZ12] : Nikolov, Talwar, Zhang, The Geometry of Differential Privacy: The Sparse and Approximate Cases.
[ST13] : Smith, Thakurta, Differentially Private Model Selection via Stability Arguments and the Robustness of the Lasso.

Differentially private machine learning

[KLNRS08]: Kasiviswanathan, Lee, Nissim, Raskhodnikova, Smith, What can we learn privately?
[CMS11]: Chaudhuri, Monteleoni, Sarwate, Differentially private empirical risk minimization.
[BKN13]: Beimel, Kasiviswanathan, Nissim, Bounds on the Sample Complexity for Private Learning and Private Data Release.
[BNS13]: Beimel, Nissim, Stemmer, Characterizing the Sample Complexity of Private Learners.
[ST13] : Smith, Thakurta, Differentially Private Model Selection via Stability Arguments and the Robustness of the Lasso.
[BST14]: Bassily, Smith, Thakurta, Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds.
[BNSV15]: Bun, Nissim, Stemmer, Vadhan, Differentially Private Release and Learning of Threshold Functions.

Local (Distributed) Model of Differential Privacy

[HKR12]: Hsu, Khanna, Roth, Distributed Private Heavy Hitters.
[DJW13]: Duchi, Jordan, Wainwright, Local Privacy, Data Processing Inequalities, and Minimax Rates.
[EKP14]: Erlingsson, Korolova, Pihur, RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response.
[BS15]: Bassily, Smith, Local, Private, Efficient Protocols for Succinct Histograms.
[BNST17]: Bassily, Nissim, Stemmer, Thakurta, Practical Locally Private Heavy Hitters.

Differential Privacy for Streaming

[DNPR10]: Dwork, Naor, Pitassi, Rothblum, Differential Privacy Under Continual Observation.
[CSS11]: Chan, Shi, Song, Private and Continual Release of Statistics.

Lower Bounds (Limits of Differential Privacy)

[HT10]: Hardt, Talwar, On the Geometry of Differential Privacy.
[De11]: De, Lower bounds in differential privacy.
[BUV14]: Bun, Ullman, Vadhan, Fingerprinting Codes and the Price of Approximate Differential Privacy.
[BST14]: Bassily, Smith, Thakurta, Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds.

Relaxations of Differential Privacy (Non-worst case definitions allowing for distributional assumptions)

[BBGLT12]: Bhaskar et al., Noiseless Database Privacy.
[KM12]: Kifer, Machanavajjhala, A Rigorous and Customizable Framework for Privacy.
[BGKS13]: Bassily, Groce, Katz, Smith, Coupled-Worlds Privacy: Exploiting Adversarial Uncertainty in Statistical Data Privacy.
[BF16]: Bassily, Freund, Typical Stability.

Differential Privacy for Robust Adaptive Data Analysis

[DFHPRR15-a]: Dwork, Feldman, Hardt, Pitassi, Reingold, Roth, Preserving Statistical Validity in Adaptive Data Analysis.
[BNSSSU16]: Bassily, Nissim, Smith, Steinke, Stemmer, Ullman, Algorithmic Stability for Adaptive Data Analysis.
[DFHPRR15-b]: Dwork, Feldman, Hardt, Pitassi, Reingold, Roth, Generalization in Adaptive Data Analysis and Holdout Reuse.
[RRST16]: Rogers, Roth, Smith, Thakkar, Max-Information, Differential Privacy, and Post-Selection Hypothesis Testing.
[FS17-a]: Feldman, Steinke, Generalization for Adaptively-chosen Estimators via Stable Median.
[FS17-b]: Feldman, Steinke, Calibrating Noise to Variance in Adaptive Data Analysis.