Nowadays, there is a vast and rapidly increasing quantity of government, scientific, corporate, and crowd-sourced data published on the Web. The emerging Web of data realizes the vision of a data-driven decision-making and is expected to play a catalyst role in the way structured information is exploited across-domains in a large scale. It offers a great potential for building innovative data products and services that create new value from already collected data by fostering active citizenship (awareness regarding  greenhouse gas emissions, food supply-chains, etc.), urban sustainability (smart energy consumption, mobility, etc.), but also world-wide research according to the fourth paradigm of science.

In particular, according to the Linked Data paradigm, real world entities are described on the Web by interlinked data rather than documents. Data publishers are thus encouraged to describe entities using W3C standards (RDF, etc.) by naming them with HTTP URIs, so that people can access their descriptions on the Web, as well as by including links to other URIs, so that people can discover more entities (persons, places, artefacts, etc.). Due to the open and decentralized nature of the Web, real world entities are usually described in multiple datasets using different URIs in a partial, overlapping and sometimes evolving way. Recognizing descriptions of the same real-world entity across (and sometimes within) data sources becomes a central problem in the Web of data in order to significantly increase dataset linking. Typical examples include linking of customer or patient records, bibliographic citations, points of interest, events, cultural artefacts, etc.

Data describing entities could be made available in the Web under different formats (e.g., tabular, tree or graph) of varying structuredness and licensing. Entity resolution techniques proposed for relational data (merging customer databases, library catalogues, etc.) are not suited for the Web of data due to the high heterogeneity and non-regularity in data structuring. It is worth noticing that ontologies actually employed by linked data represent in their vast majority simple vocabularies (of classes or properties) rather than schemas imposing extensive structural and semantic data constraints.

Some of the challenges arising in this context are:
  1. How can we compare two entity descriptions when they do not adhere to the same structure?
  2. How can entities be resolved when their descriptions exhibit complex structures?
  3. How can entity resolution comparisons benefit from scalable computing platforms (Map/Reduce, etc.)?

The solution space of entity resolution

Entity resolution has been studied in a variety of contexts, using several approaches. In this tutorial, we introduce three dimensions for the solution space of entity resolution, namely type of method input, objective of method and type of method. Concerning the type of input data, we consider the cases of tabular, tree and graph data. We study entity resolution approaches with respect to their particular objectives, i.e. effectiveness, efficiency and scalability, and their methods, by distinguishing them between blocking, iterative and learning ones.

Type of method input: Different types of input data impose different solutions for the problem of entity resolution. In particular, due to the high structuredness of tabular data, for computing similarities between entity descriptions of this type, it suffices to compare the values of their common attributes. For the case of tree data, e.g. XML, where a hierarchical structure exists, i.e. ancestor and descendant relations, the similarity of values, used to compare two entity descriptions, is affected by the similarity of their ancestors and descendants. Data can be also graph structured, as in RDF, featuring cycles and non-unique root elements. Hence, the problem of computing similarities becomes harder.

Objective of method: We discern entity resolution methods with respect to their main objectives between those that aim at effectiveness, efficiency, and scalability. Algorithms focusing on effectiveness, try to find as many true matches and as few false matches as possible, while algorithms focusing on efficiency aim at resolving the given entity descriptions as fast as possible, usually by reducing the huge number of redundant comparisons. Scalable entity resolution methods are those that can cope with Big Data, typically, by parallelizing the task of entity resolution and distributing it to multiple computational resources, such as Map/Reduce nodes.

Type of method: Methods that perform entity resolution are categorized as blocking, iterative, or learning. Blocking-based approaches group together, in the same block, entity descriptions that are close to each other, denoting possible matches. To accomplish that, they typically rely on blocking keys, i.e. criteria based on which the descriptions are placed into blocks. Iterative algorithms are based on the idea that identifying some matching entity descriptions can lead to new matches, e.g. by using the transitive closure of the matched pairs, or by using the already merged descriptions. Learning-based approaches typically use an initial set of training data, annotated as matches and non-matches, and then classify the entity descriptions in these two categories, using statistical inference.

The tutorial concludes with a discussion on open issues related to entity resolution in the Web of data.