Preserving Privacy and Utility in Text Data - Tom Diethe
Amazon prides itself on being the most customer-centric company on earth. That means maintaining the highest possible standards of both security and privacy when dealing with customer data. This month, at the ACM Web Search and Data Mining (WSDM) Conference, my colleagues and I will describe a way to protect privacy during large-scale analyses of textual data supplied by customers. Our method works by, essentially, re-phrasing the customer supplied text and basing analysis on the new phrasing, rather than on the customers’ own language.
Perfectly Privacy-Preserving AI: What is it and how do we achieve it? - Patricia Thaine
Many AI applications need to process huge amounts of sensitive information for model training, evaluation, and real-world integration. These tasks include facial recognition, speaker recognition, text processing, and genomic data analysis. Unfortunately, one of the following two scenarios occur when training models to perform the aforementioned tasks: either models end up being trained on sensitive user information, making them vulnerable to malicious actors, or their evaluations are not representative of their abilities since the scope of the test set is limited. In some cases, the models never get created in the first place. There are a number of approaches that can be integrated into AI algorithms in order to maintain various levels of privacy. Namely, differential privacy, secure multi-party computation, homomorphic encryption, federated learning, secure enclaves, and automatic data de-identification. We will briefly explain each of these methods and describe the scenarios in which they would be most appropriate. Recently, several of these methods have been applied to machine learning models. We will cover some of the most interesting examples of privacy-preserving ML, including the integration of differential privacy with neural networks to avoid unwanted inferences from being made of a network’s training data. Finally, we will discuss how the privacy-preserving machine learning approaches that have been proposed so far would need to be combined in order to achieve perfectly privacy-preserving machine learning
Calibrating Mechanisms for Privacy Preserving Text Analysis - Oluwaseyi Feyisetan
This talk presents a formal approach to carrying out privacy preserving text perturbation using a variant of Differential Privacy (DP) known as Metric DP (mDP). Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies mDP where the privacy parameter ε provides guarantees with respect to a distance metric defined by the word embedding space.We demonstrate how ε can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We also conduct experiments on well-known datasets to demonstrate the tradeoff between privacy and utility for varying values of ε on different task types. Our results provide insights into carrying out practical privatization on text-based applications for a broad range of tasks.
Privacy-Aware Personalized Entity Representations for Improved User Understanding - Levi Melnick
Representation learning has transformed the field of machine learning. Advances like ImageNet, word2vec, and BERT demonstrate the power of pre-trained representations to accelerate model training. The effectiveness of these techniques derives from their ability to represent words, sentences, and images in context. Other entity types, such as people and topics, are crucial sources of context in enterprise use-cases, including organization, recommendation, and discovery of vast streams of information. But learning representations for these entities from private data aggregated across user shards carries the risk of privacy breaches. Personalizing representations by conditioning them on a single user’s content eliminates privacy risks while providing a rich source of context that can change the interpretation of words, people, documents, groups, and other entities commonly encountered in workplace data. In this paper, we explore methods that embed user-conditioned representations of people, key phrases, and emails into a shared vector space based on an individual user’s emails. We evaluate these representations on a suite of representative communication inference tasks using both a public email repository and live user data from an enterprise. We demonstrate that our privacy-preserving lightweight unsupervised representations rival supervised approaches. When used to augment supervised approaches, these representations are competitive with deep-learned multi-task models based on pre-trained representations.
Classification of Encrypted Word Embeddings using Recurrent Neural Networks - Robert Podschwadt
Deep learning has made many exciting applications possible and given the popularity of social networks and user generated content everyday there is no shortage of data for these applications. The content generated by the users is written or spoken in natural language which needs to be processed by computers. Recurrent Neural Networks (RNNs) are a popular choice for language processing due to their ability to process sequential data. On the other hand, this data is some of the most privacy sensitive information. Therefore, privacy-preserving methods for natural language processing are crucial. In this paper, we focus on settings where a client has private data and wants to use machine learning as a service (MLaaS) to perform classification on the data without the need to disclose the data to the entity offering the service. We employ homomorphic encryption techniques to achieve this. Homomorphic encryption allows for data being processed without it being decrypted thereby protecting the users privacy. Although homomorphic encryption has been used for privacy-preserving machine learning, most of the work has been focused on image processing and convolutional neural networks (CNNs), but RNNs have not been studied. In this work, we use homomorphic encryption to build privacy-preserving RNNs for natural language processing tasks. We show that RNNs can be run over encrypted data without loss in accuracy compared to a plaintext implementation by evaluating our system on a sentiment classification task on the IMDb movie review dataset.
A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework based on Multi-input Neural Network - A K M Nuhil Mehdy
Data and information privacy is a major concern of today’s world. More specifically, users’ digital privacy has become one of the most important issues to deal with, as advancements are being made in information sharing technology. An increasing number of users are sharing information through text messages, emails, and social media without proper awareness of privacy threats and their consequences. One approach to prevent the disclosure of private information is to identify them in a conversation and warn the dispatcher before the conveyance happens between the sender and the receiver. Another way of preventing information (sensitive) loss might be to analyze and sanitize a batch of offline documents when the data is already accumulated somewhere. However, automating the process of identifying user-centric privacy disclosure in textual data is challenging. This is because the natural language has an extremely rich form and structure with different levels of ambiguities. Therefore, we inquire after a potential framework that could bring this challenge within reach by precisely recognizing users’ privacy disclosures in a piece of text by taking into account - the authorship and sentiment (tone) of the content alongside the linguistic features and techniques. The proposed framework is considered as the supporting plugin to help text classification systems more accurately identify text that might disclose the author’s personal or private information.
Is It Possible to Preserve Privacy in the Age of AI - Vijayanta Jain
Artificial Intelligence (AI) hopes to provide a positive paradigm shift in technology by providing new features and personalized experience to our digital and physical world. In the future, almost all our digital services and physical devices will be enhanced by AI to provide us with better features. However, as training artificially intelligent models require a large amount of data, it poses a threat to user privacy. The increasing prevalence of AI promotes data collection and consequently poses a threat to privacy. To address these concerns, some research efforts have been directed towards developing techniques to train AI systems while preserving privacy and help users preserve their privacy. In this paper, we survey the literature and identify these privacy-preserving approaches that can be employed to preserve privacy. We also suggest some future directions based on our analysis. We find that privacy preserving research, specifically for AI, is in its early stage and requires more effort to address the current challenges and research gaps