To get you started, we are providing some pointers for your solutions. We,however, encourage you to come up with your own innovative solutions to the problem.
- Manual annotations
- BRAT tool can be used to crowd source the problem and let human annotators guess the masked entities, and optionally impute values too. But a more feasible solution is to let human annotators provide the entity types for the masked entities, and then use some dictionary to impute values of that type.
- Rule based annotations
- A rule based system, which uses dictionaries (of names, places, credit card numbers etc) can be used to find patterns in sentences, and replace the masked portions. IBM’s System T (or any other solution, or perhaps just regular expressions) can be used to find such patterns in sentences. The following course provides an introduction to SystemT.
- Data Programming
- Snorkel is a system used for generating large amounts of noisy training data in a short time. After generating a gold set using manual methods, this system could be used to annotate more.
- Model based annotations
- A machine learning model can also be used to generate words/numbers to replace the redacted portions in a sentence. This problem can perhaps be solved using Natural Language Generation models which typically tend to be sequence-to-sequence models.
- Natural Language Generation
- Sequence-to-sequence models
- https://github.com/hackathoniima/ICADABAI2019
- You can refer to the sample code to read the input file and write to an output file, in our github repository. This will help you jump-start your solution.