Towards Hierarchical Affiliation Resolution
Framework, Baselines, Dataset
Alternative Visualizations / JSON
Within the framework, our general approach is quite versatile.
In the paper, we have presented a conservative baseline for conflation that has the convenient properties of being order-invariant, not requiring observation counts and practically never merging too much. However it was noted that in its conservatism, it does not merge many nodes that actually belong to the same organizational branch.
Below we visualize an alternative conflation process that uses edge weights computed dynamically from observation counts (including discounting observation mass to lower branches).
In the header of each node are given numbers: obs (obs') | car (car')
obs: original observation count (number of mentions represented by this node)
car: original carry count (number of mentions represented by this node or by a specification)
obs': observation count at current state of discounting
car' carry count at current state of discounting
The visualizations are not shown to introduce and explain the above mentioned conflation method but simply to underline the versatility and potential of our framework.
Correspondingly, we also present a number of JSON output files including the original affiliation strings and their frequency for each representation.
As JSON is a tree structure, we create a redundant tree from each graph that duplicates nodes for each incoming edge in the DAG.
Please contact the authors if you are interested in more details or want to use the complete output.
German Institutions
International Institutions
Dataset
We have created a small hierarchical evaluation dataset that links the top-level annotation from "the KB disambiguation system" with the GERiT hierarchies. We have also implemented the means for easily extending this dataset by verifying or falsifying automatically suggested links.
The evaluation dataset has the following structure, where each row corresponds to one WoS mention linked to a GERiT node.
dfgid: the node identifier in GERiT
dfg_de: the German name of the node
dfg_en: the English name of the node
mentionID: the internal identifier of the WoS affiliation automatically linked to that node
PK_KB_INST: the top-level identifier from "the KB disambiguation system"
score: a certainty score from the automatic linking process
ADDRESS_FULL: the full WoS affiliation string linked to the node
ref_string: the above string with the street address removed where possible
verified1: Boolean value whether the automatic link was manually verified by a first annotator
verified2: Boolean value whether the automatic link was manually verified by a second annotator
Raw Code
At this repository, you can find the current version of our code for the presented framework and methods. It has not yet been cleaned and commented for third party use. Please contact the authors if you are interested in using the code.
Note: The code used for finding minimal elements and connected components during the separation step is currently not included.