One of our main research topics is data integration. Data integration is the field of combining different sources of data to enrich one from the other or to increase our knowledge of the world.
Software tools that support semi-automated data integration, can be divided into two broad groups:
1) Those designed to support the integration of Data descriptions, also known as Schemas and Ontologies.
2) Those designed to handle duplicate data, also known as Data Deduplication, Record Linkage, Entity Resolution, and Entity Consolidation.
We provide a brief overview of the two main topics within data integration below: schema matching and entity resolution. If you are so inclined you can watch a series of videos on the subject prepared by Laura Haas, Mary Roth, and Lucian Popa from the University of Massachusetts Amherst and the IBM Research Almaden.
Schema matching is data integration process that examines semantic relation and produces a map between two schemas. This problem occurs while matching schemas from different systems in the same domain. For example, the schema of an old CRM system needs to be mapped to that of the new CRM system when upgrading, to facilitate data transfer between them. Previously this process was performed manually but recent researchers succeeded to perform automatic schema matching partially by using machine learning.
The lab maintains the Ontobuilder research environment together with Dr. Avi Gal of Technion.
Here are a few more popular matching systems:
Cupid [2].
COMA [3].
NOM (Naive Ontology Mapping).
QOM (Quick Ontology Mapping).
OLA (OWL Lite Aligner).
S-Match - Semantic relation between two graphs.
Artemis.
There are two matching techniques:
1. Element techniques:
String matching.
NLP techniques.
Alignment reuse(External resources).
Upper-level formal ontologies(External sources of common knowledge).
Constraint (such as types, a cardinality of attributes, and keys ).
2. Structure techniques:
Graphs(Graph matching, Children, Leaves, Relations).
Taxonomy - Graph algorithm considers only the specialization relation.
Repository of structures - similarities between schemas/ontologies.
Model-based algorithm - semantic interpretation
Schema matching prediction with applications to data source discovery and dynamic ensembling, Tomer Sagi · Avigdor Gal
Generic schema matching with Cupid. In Proceedings of the Very Large Data Bases Conference (VLDB), pages 49–58, 2001. 2005. 46. J. Madhavan, P. Bernstein, and E. Rahm.
COMA - a system for flexible combination of schema matching approaches. In Proceedings of the Very Large Data Bases Conference (VLDB), pages 610– 621, 2001. H. H. Do and E. Rahm.
Bernstein, Philip A., Jayant Madhavan, and Erhard Rahm. "Generic schema matching, ten years later." Proceedings of the VLDB Endowment 4.11 (2011): 695-701.
Entity resolution is linking and grouping different manifestations of the same real-world object. The purpose of the ER is to identifying the records that represent the same entity and reconciling them to obtain one record per entity.
Ebraheem, Muhammad, et al. "DeepER--Deep Entity Resolution." arXiv preprint arXiv:1710.00597 (2017).
The fundamental contribution is the identification of the concept of distributed representation as a key building block for designing effective ER classifiers
Sagi, Tomer, et al. "Multi-source uncertain entity resolution at yad vashem: Transforming holocaust victim reports into people." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
This paper represents a unique opportunity to apply state-of-the art research prototypes to a real-life dataset.
Michelson, Matthew, and Craig A. Knoblock. "Learning blocking schemes for record linkage." AAAI. 2006.
Showed a technique that improves over the hand-generated blocking schemes produced by non-domain experts.
Wang, Qing, Mingyuan Cui, and Huizhi Liang. "Semantic-aware blocking for entity resolution." Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 2016.
Framework that takes into account both textual and semantic similarities in the ER blocking process.