PSML 2017 - ABSTRACTS

Invited talks:

Peeter Laud. Privacy-preserving machine learning and data mining using the Sharemind platform

We have built the Sharemind platform for privacy-preserving computations, based on additively sharing the secrets among three parties and a large, passively secure protocol set built on top of this representation of private data. Using this protocol set, we have built various privacy-preserving numerical, statistical, anomaly detection, and combinatorial applications and prototypes on top of Sharemind, some of them for use-cases with a large number of inputs and correspondingly large computation size.

In this talk, I explain the basics of Sharemind and the construction of applications which may be of interest to the ML community. This covers the linear regression and principal component analysis, genetic algorithms, frequent itemset mining. I also explain how differentially private computations have been done on top of Sharemind: we implemented the sample-and-aggregate mechanism by Nissim et al. and Smith, as well as a technique to keep track of personalized differential privacy budgets by Ebadi et al. There are important differences between normal and privacy-preserving applications, when it comes to the relative efficiency of certain algorithmic steps. These have affected the construction of the applications that I'm going to talk about.

Ilya Mironov. Deep Learning with Differential Privacy: Two Approaches

We discuss two recently proposed approaches towards offering differential privacy for training data. The first approach modifies the SGD procedure so that the updates to the model's weights are provably differentially private. The second approach, called Private Aggregation of Teacher Ensembles (PATE) is particularly suitable for training classifiers. PATE combines, in a black-box fashion, multiple models trained with disjoint datasets and aggregates their results---enforcing differential privacy---to train a "student" model who is never directly exposed to sensitive data.

Contributed talks:

Bryan Cai, Constantinos Daskalakis and Gautam Kamath. Priv'IT: Private and Sample Efficient Identity Testing

T1.pdf

Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay and Gerome Miklau. Differentially Private Learning of Undirected Graphical Models using CGMs

We investigate the problem of learning discrete, undirected graphical models in a differentially private way. Approaches to this problem range from privileged algorithms that conduct learning completely behind the privacy barrier to schemes that release private summary statistics paired with algorithms to learn parameters from those statistics. We show that the approach of releasing noisy sufficient statistics using the Laplace mechanism achieves a good trade-off between privacy, utility, and practicality. A naive learning algorithm that uses the noisy sufficient statistics ``as is'' outperforms general-purpose differentially private learning algorithms. However, it ignores knowledge about the data generating process, is on uncertain theoretical foundations, and exhibits certain pathologies. We develop a more principled approach that applies the formalism of collective graphical models to perform inference over the true sufficient statistics within an expectation-maximization framework. We show that this learns better models than competing approaches on both synthetic data and on real human mobility data used as a case study.

Marko Mitrovic, Amin Karbasi, Andreas Krause and Mark Bun. Differentially Private Submodular Maximization: Data Summarization in Disguise

How can we extract representative features from a dataset containing sensitive personal information, while providing individual-level privacy guarantees? Many data summarization applications are captured by the general framework of submodular maximization. As a consequence, a wide range of efficient approximation algorithms for submodular maximization have been developed. However, when such applications involve sensitive data about individuals, their privacy concerns are not automatically addressed by these algorithms.

To remedy this problem, we propose a general and systematic study of differentially private submodular maximization. We present privacy-preserving algorithms for both monotone and non-monotone submodular maximization under cardinality, matroid, and p-extendible system constraints, with guarantees that are competitive with optimal solutions. Along the way, we analyze a new algorithm for non-monotone submodular maximization under a cardinality constraint, which is the first (even non-privately) to achieve a constant approximation ratio with a linear number of function evaluations. We additionally provide two concrete experiments to validate the efficacy of these algorithms. In the first experiment, we privately solve the facility location problem using a dataset of Uber pickup locations in Manhattan. In the second experiment, we perform private submodular maximization of a mutual information measure to select features relevant to classifying patients by diabetes status.

Aleksandra Korolova. The Hybrid Model for Privacy and its Benefits

T4.pdf

Nathalie Baracaldo, Bryant Chen and Heiko Ludwig. Detecting Causative Attacks using Data Provenance

T5.pdf

Mentari Djatmiko, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Maximilian Ott, Giorgio Patrini, Guillaume Smith, Brian Thorne and Dongyao Wu. Privacy-preserving entity resolution and logistic regression on encrypted data

T6.pdf

Posters:

Martine De Cock, Rafael Dowsley, Nick McKinney, Anderson Nascimento and Dongrui Wu. Privacy Preserving Machine Learning with EEG Data

P1.pdf

Takuya Hayashi, Shohei Kuri, Toshiaki Omori, Seiichi Ozawa, Yoshinori Aono, Le Trieu Phong, Lihua Wang and Shiho Moriai. Privacy-preserving machine learning via additively homomorphic encryption: the case of linear and logistic regressions

In the big data era, the amount of data has been exploded and it becomes hard to do machine learning with such gigantic data using relatively small computational resource. In such case the use of cloud computing platforms is beneficial to obtain large computational resource easily.

When cloud computing is employed one needs to put one's own data into a cloud server, therefore privacy risk would arise when the data includes sensitive information. To prevent such privacy issues, in this work, we develop a secure outsourcing framework for machine learning with three participants model: data provider, outsourced server, and data analyst, and data leakage to the outsourced server is regarded as a threat, i.e., an honest-but-curious server is considered. We utilize an additive homomorphic encryption scheme to achieve the security requirement. Since the homomorphic encryption scheme only supports addition, it is not easy to compute machine learning algorithm straightforwardly. To circumvent the problem, we introduce pre-processing for the data provider and post-processing (including optimization of cost function) for the data analyst.

Our framework is as follows:

firstly modify and approximate the cost function as a linear function of parameter variables, then each data provider pre-processes to compute some products of data values and encrypts them to send the outsourced server. After that, the outsourced server computes summations of encrypted data by homomorphic additions and sends the summations to the data analyst. The data analyst decrypts the summations and computes the individual cost function to be optimized. This framework is efficient and scalable to the amount of data; the data provider and data analyst cost $O(poly(d))$ for the number of attributes $d$, although the outsourced server costs $O(poly(d)N)$ for the number of data records $N$. Therefore the data provider and data analyst are not much affected even when $N$ is gigantically large.

We construct secure linear regression and (multi-class) logistic regression as instantiations of our framework and evaluate with some datasets from the UCI dataset. On the exactness of our outsourcing framework, it is easily proved that our secure linear regression provides the same solution with the original linear regression over plain data. On the other hand, since our logistic regression approach employs approximation of sigmoid/softmax functions, a solution provided by our approach is not identical. However it is not much far from a solution for the original problem, for example, our secure logistic regression shows 81.9\% accuracy (originally 85.9\%) with the Adult Income dataset for binary-class, and 93.5\% accuracy (originally 95.3\%) with the Digit dataset for multi-class. For efficiency, our regressions are very scalable to the amount of data; our simulation shows that our secure linear regression with $10^8$ data records with 20 attributes requires less than 30 mins for the outsourced server. It is worth noting that our secure linear/logistic regressions can be combined with the functional mechanism to add differential privacy.

This work is partially supported by JST CREST number JPMJCR168A.

Adria Gascon, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner, Samee Zahur and David Evans. Privacy-Preserving Distributed Linear Regression on High-Dimensional Data

P3.pdf

Martine De Cock, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento, Wing-Sea Poon and Stacey Truex. Efficient Privacy-Preserving Scoring of Decision Trees, SVM and Logistic Regression Models

P4.pdf

Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang and Shiho Moriai. Privacy-Preserving Deep Learning: Revisited and Enhanced

We build a privacy-preserving deep learning system in which many learning participants perform neural network-based deep learning over a combined dataset of all, without actually revealing the participants' local data to an honest-but-curious server. To that end, we revisit the previous work by Shokri and Shmatikov (ACM CCS 2015) and point out that local data information may be actually leaked to the curious server. We then move on to fix that problem via building an enhanced system with following properties: (1) no information is leaked to the server; and (2) accuracy is kept intact, compared to that of the ordinary deep learning system also over the combined dataset.

Conceptually, our system is a bridge between deep learning and cryptography: we utilise stochastic gradient descent (SGD) applied to neural networks, in combination with additively homomorphic encryption. We show that our usage of encryption adds little overhead in communication to the ordinary deep learning system.

Concretely, the increased communication costs are not big: less than 3 times compared with original costs when learning over plain data, for concrete datasets MNIST and SVHN. For example, in the case of MNIST, if each learning participant needs to communicate 0.56 MB of plain gradients to the server at each upload or download; then in our system with lattice-based (specifically LWE-based) encryption, the corresponding communication cost at each upload or download becomes 1.38 MB which needs about 0.11 seconds to be transmitted over a 100 Mbps channel. On the computational side, the most frequent operation is ciphertext additions over the server. As each ciphertext is a vector of integers, the addition cost is also small. For example, it takes only 0.013 seconds to add two LWE-based ciphertexts with parameters for the SVHN dataset over a server (2.60GHz) with one thread of computation.

The paper related to this abstract has been accepted to the 8-th International Conference on Applications and Technologies in Information Security (ATIS 2017), held at Massey University on 6-7 July 2017. This work is partially supported by JST CREST #JPMJCR168A.

Adria Gascon, Phillipp Schoppmann and Borja Balle. Private Multi-Party Document Classification

P6.pdf

Joonas Jälkö, Onur Dikmen and Antti Honkela. Differentially Private Variational Inference for Non-conjugate Models

P7.pdf

Maria-Florina Balcan, Travis Dick and Ellen Vitercik. Differentially private algorithm configuration

P11.pdf

Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma and Antti Honkela. Differentially Private Bayesian Learning on Distributed Data

In a distributed setting where each party only holds a single sample or a few samples of data, by using secure multiparty computation (SMC) approach in combination with differential privacy (DP) it is possible to guarantee privacy, while introducing only the minimum amount of noise necessary for DP. Considered generally, Bayesian learning provides a natural complement for DP because it can inherently handle uncertainty, including the uncertainty introduced to ensure DP, and it provides a flexible framework for data modelling (Williams et al 2010).

We propose a general strategy for learning in the distributed setting (Heikkilä et al 2017). Our approach builds on the asymptotically optimal (Foulds et al 2016, Honkela et al 2016) and practically efficient DP Bayesian inference with rapidly diminishing extra cost. The method is based on sufficient statistics perturbation, the calculation of which consists of a summation over the individual sample contributions. The same method can be applied iteratively e.g. to perform variational inference (Jälkö et al 2016) in the distributed setting.

To calculate the necessary sum, we introduce a SMC algorithm called Distributed Compute Algorithm (DCA) based on a form of secret sharing. The scheme relies on several independent aggregators, called Compute nodes. At a general level, the data holders divide their data and some blinding noise into shares that are each sent to one Compute. After receiving all shares, each Compute decrypts the values and sums them. The final results can be obtained by summing up the values from all Computes, which cancels the blinding noise. With DCA, cryptography is only required to secure the communication between the data holder and the Compute nodes. Since this does not need to be homomorphic as in many other protocols, faster symmetric cryptography can be used for the bulk of the data. For DP, we use the Gaussian mechanism. The idea is that each individual client adds a small amount of Gaussian noise to his data, resulting in the aggregated noise to be another Gaussian with large enough variance. This approach can also be leveraged to guarantee privacy against colluding or faulting data holders.

We test our approach on simulated and real data, using linear regression as an example. As an alternative to the standard Bayesian linear regression model, we consider a more efficient version that uses a non-linear projection of the data to improve prediction performance in the DP context (Honkela et al 2016). We also extend the asymptotic optimality proofs for Laplace mechanism presented by Honkela et al. (2016) to cover the Gaussian mechanism as well.

References

James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving Bayesian data analysis. In Proceedings of the Thirty- Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pages 192–201, Ar- lington, Virginia, United States, March 2016. AUAI Press. ISBN 978-0-9966431-1-5. URL http://dl.acm.org/citation.cfm?id=3020948.3020969. arXiv:1603.07294 [cs.LG].

Mikko Heikkilä, Eemil Lagerspetz, Samuel Kaski, Kana Shimizu, Sasu Tarkoma, and Antti Honkela. Differentially private Bayesian learning on distributed data. arXiv preprint arXiv:1703.01106, 2017.

Antti Honkela, Mrinal Das, Onur Dikmen, and Samuel Kaski. Efficient differentially private learning improves drug sensitivity prediction. arXiv:1606.02109, 2016.

Joonas Jälkö, Onur Dikmen, and Antti Honkela. Differentially private variational inference for non-conjugate models. arXiv:1610.08749, 2016.

O. Williams and F. McSherry. Probabilistic inference and differential privacy. In Adv. Neural Inf. Process. Syst. 23, 2010.

Yu-Xiang Wang. Per-instance Differential Privacy and the Adaptivity of Posterior Sampling in Linear and Ridge regression

Differential privacy (DP), ever since its advent, has been a controversial object. On the one hand, it provides strong provable protection of individuals in a data set; on the other hand, it has been heavily criticized for being not practical, partially due to its complete independence to the actual data set it tries to protect. In this paper, we address this issue by a new and more fine-grained notion of differential privacy --- per instance differential privacy (pDP), which captures the privacy of a specific individual with respect to a fixed data set. We show that this is a strict generalization of the standard DP and inherits all its desirable properties, e.g., composition, invariance to side information and closedness to postprocessing, except that they all hold for every instance separately. When the data is drawn from a distribution, we show that per-instance DP implies generalization. Moreover, we provide explicit calculations of the per-instance DP for the output perturbation on a class of smooth learning problems. The result reveals an interesting and intuitive fact that an individual has stronger privacy if he/she has small ``leverage score'' with respect to the data set and if he/she can be predicted more accurately using the leave-one-out data set. Using the developed techniques, we provide a novel analysis of the One-Posterior-Sample (OPS) estimator and show that it achieves $(\epsilon,\delta)$-DP for any constant $\epsilon$ and match the exact lower bound up to a $1+O(n^{-1/3})$ multiplicative factor for any data and when the data is well-conditioned it achieves $(O(n^{-1/3}),\delta)$-pDP for any target individual. Simulation shows several orders-of-magnitude more favorable privacy and utility trade-off when we consider the privacy of only the users in the data set.