This project has two sets of synergistic goals: system development and academic.
System Development Goals
We develop py_stringmatching, a software package of tokenizers and string similarity measures, for the Python data science eco-system. Many data science tasks need such tools, but currently there is no good tool in the eco-system.
We develop this package so that we can use it to build several other Python software packages, e.g., py_stringsimjoin and py_entitymatching. These three packages and several others make up the Magellan entity matching system.
Academic Goals
We use this project to study how to develop open-source software in form of Python packages in an academic environment. We are especially interested in the entire development process. Going forward, we believe that database researchers will want to build many data management systems in form of interoperable software packages for data science eco-systems. Thus, it is important to understand this software development process.
We use this project to study how to design and combine multiple software packages within a data science eco-system. Building giant complex monolithic data analytics systems is often very difficult. A promising alternative is to develop multiple software packages (each solving a core problem) within a data management eco-system, then combine these packages to solve data analytics tasks. In this alternative approach, we need to understand how to design and combine packages within an eco-system.