Privacy is important to each one of us. Whether we share information about us with the Government or with commercial entities like social media websites, it is in our interest to:
With the advent of Data Protection Regulations like GDPR in the European Union and the Personal Data Protection Bill 2018 in India, there is increasing regulatory backing for our privacy, and the protection of our personal data.
On the other end of the spectrum, protecting the personal data of customers is a huge challenge for companies. Even identifying all the personal data available with a company is non-trivial. Identifying personal data entities in customer data, protecting and anonymizing personal data, and serving customer requests related to the usage of their data, are all part of a company’s Data analytics and regulatory compliance systems.
At the government level, this problem of protecting sensitive personal information assumes gargantuan proportions. Govt collects information about citizens for a number of reasons, like welfare, identification, and security. All these information could be linked to a unique identification number like the Aadhaar number, which further increases the data protection requirements.
A Personal Data Entity is any entity that can help identify or profile a person, and hence needs to be protected, and used only with the consent of people. Some examples of the Personal Data Entities include the following.
Each of the above PDEs can be assigned one or more fine grained types called PDE Types (PDETs).
Because of the recent advances in Deep Learning, there have been few attempts to detect personal data entities in unstructured text using neural models. However, neural models require large amounts of training data to make good predictions. We cannot train models on the personal data of real people, because of privacy reasons. Hence any such training dataset has to be artificially generated.
It may not be possible to manually annotate datasets large enough to train neural models. Hence we have to come up with ways to programmatically annotate personal data entities in unstructured text.
A team in IBM Research have discussed their dataset generation method in the following research paper.
Riddhiman Dasgupta, Balaji Ganesan, Aswin Kannan, Berthold Reinwald, and Arun Kumar. "Fine Grained Classification of Personal Data Entities." arXiv preprint arXiv:1811.09368 (2018). https://arxiv.org/abs/1811.09368
However, there continues to be a need for richer and more diverse datasets that can advance the research in identifying personal data entities, which in-turn will improve privacy, and the protection of personal details that people share with the govt and private companies.
IIM Ahmedabad and IBM are excited to bring this dataset generation and coding challenge to students, budding data scientists, industry experts and fellow academicians. The challenge is to generate datasets with fictional but realistic personal data to advance AI research for identifying personal data entities in documents.
Generating training data for machine learning models requires intuition, an Aha! idea, that saves lot of manual effort, and leads to much better neural model performance. Hence we’re posing this problem to you! We’re seeking fresh, and out of the box ideas to accomplish this task. We’ll however provide you with resources and some methods that we’re familiar with, to get you started.
We’ll be providing a corpus of English texts which are from customer complaints to financial companies. The personal data entities in these texts have already been removed and contain placeholders like xxxx. As part of this hackathon, you have to impute (create new) values for the personal data entities that have been redacted from texts.
While the intellectual curiosity to solve a problem is likely to be the main motivation for you to participate in this hackathon, we hope this exercise will also be of use to the academic community, government and industry in India. The datasets that we produce during this hackathon can be made available to researchers, to come up with better models to improve privacy and regulatory compliance. Based on the number of submissions, we might be able to produce a combined large dataset, with more diversity of personal data entities than each of the teams attempting separately.
Since dataset generation hackathons are relatively new, we ask for your patience if you encounter bugs or issues with the data sets/ tools. Please report any issues to hackathon.iima@gmail.com and we will try to address the issues.
A method or a system to automatically impute values for the redacted portions in a text, which are known to have contained Personal Data Entities (PDEs).
“My credit card number is xxxx and I wish to raise a compliant ….”.
In the above text, the entity masked with xxxx is the redacted portion. We might be able to guess that a 16 digit credit card number was originally present in this text.
The simplest output we are looking for is a re-written text. In this example, the redacted portion should be replaced with some variant of a 16 digit number.
“My credit card number is 1234-5678-9012-3456 and I wish to raise a compliant ….”
However, a better output will be credit card number which is not completely random, but obeys the Luhn algorithm.
For this hackathon, we’ll use popular datasets provided by Dataquest, namely the Credit Card Compliaints dataset. We’ve created a smaller version of it by combining the unstructured text columns in the two datasets made available by Dataquest. The dataset is available at https://github.com/hackathoniima/ICADABAI2019.
You should submit a single json file, containing an array of json elements like below.
You can refer to the sample code to read the input file and write to an output file, in our github repository here. This will help you jump-start your solution.
Notes:
Training and Validation phase:
To encourage maximum participation, we're providing a corpus of 26636 English text at the beginning itself. There will be no separate test corpus. These texts are selected from the Dataquest corpus mentioned above. You can attempt to impute personal data entities in these texts during the entire duration of this hackathon. We’ll evaluate your solution by typing the entities given by you using a neural entity classification model, and comparing our model generated entity types with the types given by you.
We’ll then assign an F1 score for your solution. You can submit your solution file a maximum of 3 times during the hackathon to get this F1 score. This will provide feedback on your solution and help improve it. These submissions are optional. Please note that these optional submissions are evaluated automatically by our Deep Learning model for entity typing.
Please refer to https://github.com/hackathoniima/ICADABAI2019 for information on making submissions.
During the time of final submission, you’ll have to provide the output file with 26636 texts, imputed entities and their types, in the json format shown above. We’ll be manually imputing personal data entities and types for about 300 randomly selected texts. We’ll evaluate your solution by comparing 300 manually annotated texts with with your outputs, and assign the final F1 score. We’ll provide our manual annotations so that you can verify yourself too. All the submissions will then be ranked on the descending order of F1 score.
We highly encourage each of the teams participating in this hackathon to write a short paper describing your solution. This could lead to further collaboration and better results. However, this is optional.
There will upto three prizes based on the recommendation of the evaluation committee. Each prize will consist of a cash award and a certificate of merit.
You can post your questions at 2019 Hackathon IIMA Google groups. You can also reach out to us at hackathon.iima@gmail.com.