Title: Splink: An open source package for record linkage at scale using Apache Spark
The UK’s Ministry of Justice (MoJ) has begun an ambitious programme of work called Data First, which aims to improve the quality of the department’s data to enable better research.
As part of this work, the team needed to develop an approach to data linking and deduplication that would work at the scale of around 100 million records, more than most existing open source record linkage packages are able to handle. The work also needed to demonstrate very high standards of transparency and robustness to make it suitable for government use, and to ensure results could be easily explained to linked data users.
The result is a new data linkage package called Splink, written in PySpark. The package implements the Fellegi-Sunter model of record linkage, with parameters estimated using the Expectation Maximisation algorithm. It introduces a number of innovations that makes record linkage and deduplication faster and more flexible, including a variety of graphical outputs and heavily customisable configuration options. This talk will introduce the software, concentrating on its innovative features and how it has been used to improve the quality and speed of record linkage at the MoJ.
Bio:
Robin Linacre is a data scientist leading work on data linking methodology at the Ministry of Justice. He has a background in econometrics but more recently has worked on a variety of open source source analytical packages and infrastructure. In his previous role, he worked on the MoJ's new analytical platform, designing the data engineering infrastructure to enable analysts to rapidly perform analysis on big datasets.