D'Avatar - Hackathon

D'Avatar - Reincarnation of personal data entities in unstructured data sets

ICADABAI 2019

Jan 29 - Mar 12, 2019

Welcome to IIM Ahmedabad’s second hackathon, organized in collaboration with IBM, at ICADABAI 2019.

Introduction

Privacy is important to each one of us. Whether we share information about us with the Government or with commercial entities like social media websites, it is in our interest to:

Share minimal data
Expect that our data is stored securely
Ensure our data is used only for purposes that we have agreed to, and
Ensure our data is not shared with 3rd parties without our consent.

With the advent of Data Protection Regulations like GDPR in the European Union and the Personal Data Protection Bill 2018 in India, there is increasing regulatory backing for our privacy, and the protection of our personal data.

Industry

On the other end of the spectrum, protecting the personal data of customers is a huge challenge for companies. Even identifying all the personal data available with a company is non-trivial. Identifying personal data entities in customer data, protecting and anonymizing personal data, and serving customer requests related to the usage of their data, are all part of a company’s Data analytics and regulatory compliance systems.

Governance

At the government level, this problem of protecting sensitive personal information assumes gargantuan proportions. Govt collects information about citizens for a number of reasons, like welfare, identification, and security. All these information could be linked to a unique identification number like the Aadhaar number, which further increases the data protection requirements.

Personal Data Entities (PDE)

A Personal Data Entity is any entity that can help identify or profile a person, and hence needs to be protected, and used only with the consent of people. Some examples of the Personal Data Entities include the following.

Each of the above PDEs can be assigned one or more fine grained types called PDE Types (PDETs).

Need for a Personal Data Entities dataset

Because of the recent advances in Deep Learning, there have been few attempts to detect personal data entities in unstructured text using neural models. However, neural models require large amounts of training data to make good predictions. We cannot train models on the personal data of real people, because of privacy reasons. Hence any such training dataset has to be artificially generated.

It may not be possible to manually annotate datasets large enough to train neural models. Hence we have to come up with ways to programmatically annotate personal data entities in unstructured text.

A team in IBM Research have discussed their dataset generation method in the following research paper.

Riddhiman Dasgupta, Balaji Ganesan, Aswin Kannan, Berthold Reinwald, and Arun Kumar. "Fine Grained Classification of Personal Data Entities." arXiv preprint arXiv:1811.09368 (2018). https://arxiv.org/abs/1811.09368

However, there continues to be a need for richer and more diverse datasets that can advance the research in identifying personal data entities, which in-turn will improve privacy, and the protection of personal details that people share with the govt and private companies.

IIM Ahmedabad and IBM are excited to bring this dataset generation and coding challenge to students, budding data scientists, industry experts and fellow academicians. The challenge is to generate datasets with fictional but realistic personal data to advance AI research for identifying personal data entities in documents.

What is the Dataset Generation hackathon?

Generating training data for machine learning models requires intuition, an Aha! idea, that saves lot of manual effort, and leads to much better neural model performance. Hence we’re posing this problem to you! We’re seeking fresh, and out of the box ideas to accomplish this task. We’ll however provide you with resources and some methods that we’re familiar with, to get you started.

What is expected of you

We’ll be providing a corpus of English texts which are from customer complaints to financial companies. The personal data entities in these texts have already been removed and contain placeholders like xxxx. As part of this hackathon, you have to impute (create new) values for the personal data entities that have been redacted from texts.

Why does this matter

While the intellectual curiosity to solve a problem is likely to be the main motivation for you to participate in this hackathon, we hope this exercise will also be of use to the academic community, government and industry in India. The datasets that we produce during this hackathon can be made available to researchers, to come up with better models to improve privacy and regulatory compliance. Based on the number of submissions, we might be able to produce a combined large dataset, with more diversity of personal data entities than each of the teams attempting separately.

Since dataset generation hackathons are relatively new, we ask for your patience if you encounter bugs or issues with the data sets/ tools. Please report any issues to hackathon.iima@gmail.com and we will try to address the issues.

Problem statement

A method or a system to automatically impute values for the redacted portions in a text, which are known to have contained Personal Data Entities (PDEs).

“My credit card number is xxxx and I wish to raise a compliant ….”.

In the above text, the entity masked with xxxx is the redacted portion. We might be able to guess that a 16 digit credit card number was originally present in this text.

The simplest output we are looking for is a re-written text. In this example, the redacted portion should be replaced with some variant of a 16 digit number.

“My credit card number is 1234-5678-9012-3456 and I wish to raise a compliant ….”

However, a better output will be credit card number which is not completely random, but obeys the Luhn algorithm.

Dataset

For this hackathon, we’ll use popular datasets provided by Dataquest, namely the Credit Card Compliaints dataset. We’ve created a smaller version of it by combining the unstructured text columns in the two datasets made available by Dataquest. The dataset is available at https://github.com/hackathoniima/ICADABAI2019.

Expected Output Format

You should submit a single json file, containing an array of json elements like below.

You can refer to the sample code to read the input file and write to an output file, in our github repository here. This will help you jump-start your solution.

Notes:

“text” must be one of the texts in the input json provided above.
“entity” is the value you impute.
“types” is the space separated list of entity types. It is ok to give only one entity type instead of a list of entity types. However, each of the types above must be one of the types mentioned in the link here.

Evaluation criteria and selection

Training and Validation phase:

To encourage maximum participation, we're providing a corpus of 26636 English text at the beginning itself. There will be no separate test corpus. These texts are selected from the Dataquest corpus mentioned above. You can attempt to impute personal data entities in these texts during the entire duration of this hackathon. We’ll evaluate your solution by typing the entities given by you using a neural entity classification model, and comparing our model generated entity types with the types given by you.

We’ll then assign an F1 score for your solution. You can submit your solution file a maximum of 3 times during the hackathon to get this F1 score. This will provide feedback on your solution and help improve it. These submissions are optional. Please note that these optional submissions are evaluated automatically by our Deep Learning model for entity typing.

Please refer to https://github.com/hackathoniima/ICADABAI2019 for information on making submissions.

Final Submission

During the time of final submission, you’ll have to provide the output file with 26636 texts, imputed entities and their types, in the json format shown above. We’ll be manually imputing personal data entities and types for about 300 randomly selected texts. We’ll evaluate your solution by comparing 300 manually annotated texts with with your outputs, and assign the final F1 score. We’ll provide our manual annotations so that you can verify yourself too. All the submissions will then be ranked on the descending order of F1 score.

Solution Description

We highly encourage each of the teams participating in this hackathon to write a short paper describing your solution. This could lead to further collaboration and better results. However, this is optional.

Program Committee

Submissions will be evaluated by a program committee (PC) constituted for this purpose.
The PC will shortlist the submissions for presentation at ICADABAI 2019, and the winner will be selected at ICADABAI 2019 after presentation. The details of the presentation will be shared with the shortlisted at a later date.

Getting started

Register your team here.
You are free to choose any tools that you want.
Refer to the Resources section tab at the top of this page, for some probable solutions that can be attempted.
Once you complete the code, e-mail your Github link to hackathon.iima@gmail.com with your team details. Please ensure that your submission contains the following-
- A json with results in the format prescribed above. Please refer to github page for instructions.
- Code (along with instructions on how to run it, along with results).
- Optional - A write-up that covers, Problem Statement, Methodology and Algorithm Description, Results and References

Important dates

Last date for submissions – 12th March 2019
Shortlisting of teams for presentation – 22nd March 2019
Presentation at ICADABAI 2019 – 6th/7th April 2019
Award presentation – 7th April 2019

Prize details

There will upto three prizes based on the recommendation of the evaluation committee. Each prize will consist of a cash award and a certificate of merit.

Contact Information

You can post your questions at 2019 Hackathon IIMA Google groups. You can also reach out to us at hackathon.iima@gmail.com.

References:

Personal Data Protection Bill 2018. http://meity.gov.in/writereaddata/files/Personal_Data_Protection_Bill,2018.pdf
Datasets from Dataquest https://data.world/dataquest/bank-and-credit-card-complaints
Dasgupta, Riddhiman, Balaji Ganesan, Aswin Kannan, Berthold Reinwald, and Arun Kumar. "Fine Grained Classification of Personal Data Entities." arXiv preprint arXiv:1811.09368 (2018). https://arxiv.org/abs/1811.09368

Report abuse