Privacy-Preserving Data Analysis
Spring 2018
Spring 2018
Venue: Bakers 180
Time: Mondays, 12:40-2:30 PM
Course forum on Piazza: piazza.com/osu/spring2018/cse5479
Dissemination of large volumes of personal and sensitive data has become prevalent practice in this day and age. Making such data widely available for statistical analysis and machine learning can provide a broad range of benefits. However, the results of analyses of private data can lead to devastating disclosure of sensitive information. We face two seemingly conflicting goals: gaining the benefits of machine learning based on private data, and protecting the privacy of the individuals whose data is collected. How can we achieve them both? Straightforward approaches to deal with the privacy problem such as data anonymization are at best unreliable: the last decade has seen a string of attacks that recover personal information from supposedly "anonymized" data.
The last decade has witnessed the rise of a rigorous theory to deal with this challenge. This theory is centered around a meaningful and robust mathematical definition for privacy, known as Differential Privacy. A powerful algorithmic framework for differential privacy has been developed over the years, and led to numerous practical and efficient algorithms with strong and provable privacy guarantees for various applications in machine learning, data mining, and statistics. Due to its attractive properties, differential privacy has reigned as the gold standard of statistical data privacy. Differential privacy has recently made it to the industrial domain with several successful adoptions, notably by Google and Apple.
This class will start by demonstrating the need for a rigorous privacy framework via several examples that led to high-profile privacy breaches. We will then introduce differential privacy, discuss the semantics of its guarantees, how to design algorithms that satisfy differential privacy (known as differentially private algorithms), basic properties of this notion such as closure under adaptive composition and post-processing. We will next discuss popular and powerful algorithmic techniques in the literature of differential privacy, and focus on some of the best known constructions of differentially private algorithms for several important problems in machine learning and statistical data analysis. These include algorithms that have been deployed in industry such as Google’s RAPPOR protocol and Apple’s differentially private protocols for privately learning new words, frequent emoji patterns, and other statistics from iPhone users.
The goal of this course is to introduce students to the burgeoning area of privacy-preserving data analysis, and mainly differential privacy. This course aim to help students take up a research career in data privacy, or pursue industry positions in privacy engineering, of which there has been increasing demand especially in big corporations such as Google, Apple, Uber and many others. At the end of this course, students are expected to have a solid understanding of the foundational concepts of private data analysis, and have a good grasp of the design principles of practical and useful algorithms that provide strong and provable privacy guarantees.
If you have not taken one of these classes, and feel confident you have the knowledge from other sources/courses, you can still enroll, but please, talk with me first.
(see possible topics and some relevant papers in the References section below).
– Mid-term check (discussion with groups during office hours)
– Final Report: a) Introduction: literature review + motivation, b) Problem statement, c) Proposed solution/implementation and Results.
– Project presentation: 20-25 min. per group in the last week of class.
Attacks on Privacy:
Early papers: definitions, basic mechanisms, properties
More tools and algorithmic techniques
Differentially private machine learning
Local (Distributed) Model of Differential Privacy
Differential Privacy for Streaming
Lower Bounds (Limits of Differential Privacy)
Relaxations of Differential Privacy (Non-worst case definitions allowing for distributional assumptions)
Differential Privacy for Robust Adaptive Data Analysis