Workshop on Multilingual de-Identification of (sensitive) LRs

Important information

The "Multilingual de-identification of (sensitive) LRs" workshop has been merged with the LEGAL2022 workshop (Legal and Ethical Issues in Human Language Technologies". Please visit their website for details on the program.


The General Data Protection Regulation (GDPR - Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016) ensures the protection of natural persons with regard to the processing of personal data and on the free movement of such data. The GDPR outlines a specific set of rules that protect citizens and user data and create transparency in information sharing. GDPR is the strictest data privacy regulation in the world, and considerable work is taking place to develop techniques and deploy systems that help comply with this regulation while rendering data accessible and, thus, usable for further processing. Different techniques are studied to guarantee such compliance, implying different levels of sensitive content protection and with a short- or long-term guarantee depending on whether we may have access to additional related information. In this regard, we can read about work on anonymization, de-identification and pseudonymization. While anonymization implies a zero re-identification risk, which is extremely difficult to secure, de-identification and pseudonymization represent an attainable target under the GDPR, given that this regulation defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.” Bearing this context in mind, multilingual approaches and kits for (sensitive) language resources de-identification may provide the means to share language data while also protecting private or sensitive data by spotting then deleting, obfuscating, pseudonymizing or encrypting person identifying information.