Towards Hierarchical Affiliation Resolution

Framework, Baselines, Dataset

Alternative Visualizations / JSON

Within the framework, our general approach is quite versatile.

In the paper, we have presented a conservative baseline for conflation that has the convenient properties of being order-invariant, not requiring observation counts and practically never merging too much. However it was noted that in its conservatism, it does not merge many nodes that actually belong to the same organizational branch.

Below we visualize an alternative conflation process that uses edge weights computed dynamically from observation counts (including discounting observation mass to lower branches).

In the header of each node are given numbers: obs (obs') | car (car')

obs: original observation count (number of mentions represented by this node)

car: original carry count (number of mentions represented by this node or by a specification)

obs': observation count at current state of discounting

car' carry count at current state of discounting

The visualizations are not shown to introduce and explain the above mentioned conflation method but simply to underline the versatility and potential of our framework.

Correspondingly, we also present a number of JSON output files including the original affiliation strings and their frequency for each representation.

As JSON is a tree structure, we create a redundant tree from each graph that duplicates nodes for each incoming edge in the DAG.

Please contact the authors if you are interested in more details or want to use the complete output.

German Institutions

International Institutions

Dataset

We have created a small hierarchical evaluation dataset that links the top-level annotation from "the KB disambiguation system" with the GERiT hierarchies. We have also implemented the means for easily extending this dataset by verifying or falsifying automatically suggested links.

The evaluation dataset has the following structure, where each row corresponds to one WoS mention linked to a GERiT node.

  • dfgid: the node identifier in GERiT

  • dfg_de: the German name of the node

  • dfg_en: the English name of the node

  • mentionID: the internal identifier of the WoS affiliation automatically linked to that node

  • PK_KB_INST: the top-level identifier from "the KB disambiguation system"

  • score: a certainty score from the automatic linking process

  • ADDRESS_FULL: the full WoS affiliation string linked to the node

  • ref_string: the above string with the street address removed where possible

  • verified1: Boolean value whether the automatic link was manually verified by a first annotator

  • verified2: Boolean value whether the automatic link was manually verified by a second annotator

Raw Code

At this repository, you can find the current version of our code for the presented framework and methods. It has not yet been cleaned and commented for third party use. Please contact the authors if you are interested in using the code.

Note: The code used for finding minimal elements and connected components during the separation step is currently not included.