A Round-Trip to the Annotation Store:
Open, Transferable Semantic Annotation
of Biomedical Publications
Tim Clark*, Paolo Ciccarese*, Terri Attwood†,
Anita de Waard** and Steve Pettifer†
* Massachusetts General Hospital and Harvard Medical School, Boston MA, USA;
†School of Computer Science, University of Manchester, Manchester, UK;
Laboratories, Burlington VT, USA
December 8, 2010
Ontology-driven annotation of biomedical literature is an active research area of potentially great practical importance for collaboration and information sharing.
With progress in biomedical ontologies, textmining algorithms and other software, it is now possible to integrate a critical mass of ontology-aligned metadata referencing the biomedical literature. A diverse set of participants including academic informaticians, scientific publishers, information vendors, and bibliographic repositories have made efforts in this direction. We believe it is now time to align them on a common infrastructure to provide a network effect.
For the Beyond the PDF workshop, we plan to develop an interoperability demonstration showing how open, sharable RDF metadata can be shared and freely exchanged between the SWAN Annotation Framework, and the Utopia PDF annotation system. The RDF metadata will be exchanged in Annotation Ontology (AO) format.
We believe this demonstration can help point out the enormous potential of creating common specifications and exchange points for such metadata, independent of the documents referenced.
Ontology-driven annotation of biomedical literature is an active research area of potentially great practical importance for collaboration and information sharing. We believe one of the keys to progress is the creation of an “information ecosystem” around open, transferable annotation objects referencing common elements of scientific publications, experimental data, computational workflows and biomedical databases.
That is, annotation metadata produced against common representations of a scientific publication should be fully interchangeable, just as documents themselves are. It should constitute an open, transferable artifact of scientific work.
Biomedical documents and images including scientific papers, clinical notes, research databases and both research and clinical images, contain deep specialist information which can be extracted by (a) ontology-driven text mining as well as by (b) manual annotation. Furthermore, even before a document is published, various computational tools and processes can produce (c) workflows, computations and experimental data associated with the results and interpretations contained in a document.
Area (a) is a very active research area with many techniques fully reduced to practice, with evidence of a great deal of interest by scientific publishers and the textmining community. Ontology-driven text mining can be and should be assisted by human specialist reviews and curation of the results for maximum specificity and to train search algorithms through feedback.
Area (b) is implemented in many biomedical databases using formal ontologies such as the Gene Ontology [1, 2], etc.
The metadata generated by human-reviewed ontology-driven textmining is useful in at least two ways to the biomedical community: first, it can significantly enrich the information content of journal articles when coupled to biomedical databases; second, it potentially enhances both the sensitivity and selectivity of search algorithms. The latter is of particular use in integrating across subspecialties or subdomains of interest in biomedical research.
This kind of integration is necessary to bring together information from multiple modalities of investigation in translational research on complex disorders such as Alzheimer’s, Parkinson’s, Autism, Melanoma, Lung Cancer, Schizophrenia, etc. These domains suffer simultaneously from too much, and too little, information. Creating computational intersections across research findings is a way to increase the signal-to-noise ratio within these research foci as well as to make them “semi-permeable” to relevant findings across foci .
The ontology-aligned metadata generated in human-supervised/curated textmining, should also be able to increase the power of general-purpose web search algorithms within specific domains such as biomedical research. Biomedicine is rife with technical language, synonyms, and semantic linkages which are used in everyday biomedical language but which are non-obvious to the non-specialist. These should be leveraged to increase the power and specificity of web search in this domain – but they are not, except in highly focused engines used only by a few informatics-savvy specialists.
The MIND Informatics group at Harvard/MGH has pioneered ontology-based methods for integrating the results of automated and semi-automated textmining [4, 5].
The Utopia project at The University of Manchester (http://www.utopiadocs.com) has developed methods for annotating and interacting with the content of PDFs. This enables articles published in what has previously been seen as a closed and static format to participate fully in the kind of open, semantically rich environment proposed here. [6-8].
Numerous researchers have contacted us to express an interest in formalizing these methods, and extending and refining them for use in a wide array of applications. Textmining researchers in particular have been interested in our work.
We believe it is now possible to create an “information ecosystem” amongst many producers, consumer and contributors, for properly specified metadata. The fundamental requirements are that this metadata
· is separate from, but refers to, the base documents, just as documents are separate from, but refer to, each other;
· is anchored to common reference points in the semantic web infrastructure through common URIs;
· is technologically open (open specification);
· is fully provenanced, just as documents are provenanced;
· supports metadata layering, segmentation, and branching;
· supports ownership, authorization, and privacy restrictions; and
· has a wide collaborative uptake.
Perfecting the integration of these methods will give researchers the most complete and fully sharable information possible on key topics in biomedical research, in a fully integrated way. This sharable metadata ought to be available in common workspaces and in common format.
As an initial step to advance this vision, we have agreed to align metadata models of the Utopia PDF annotation system and the SWAN Annotation Framework, using the “AO” Annotation Ontology as a common OWL/RDF representation.
We will demonstrate interoperability between these two annotation systems by generating AO metadata on a sample publication, storing it in a public repository, and selectively sharing it between the HTML and PDF representations of a biomedical publication. We believe this is a first step to being able to exchange a great variety of biomedical information artifacts freely as positional markup against PDF, HTML, and potentially, other document formats.
This initial demonstration will show how metadata in such a workspace might function, be accessed, and exchanged. The metadata in our demonstration will be exchanged as OWL/RDF using the Annotation Ontology (AO) (http://code.google.com/p/annotation-ontology/) model. AO  provides fully-provenanced metadata on both curated biomedical textmining results, and manual annotation. AO is currently the subject of a community specification effort in the World Wide Web Consortium (W3C).
We intend to align this work with other related initiatives such as the Citation Ontology , SWAN-SIOC [5, 10-13], and the recently-proposed nano-publication schema . Because AO is orthogonal to all domain-specific representations, it can work with a potentially unlimited number of domain ontologies. Therefore application of AO is not limited merely to domain representations chosen by us, but can be used by all interested parties.
We hope later to extend this proof of concept demonstration to a fully-fledged computational laboratory for biomedical textmining, again with common metadata specification, layering and exchange. This textmining laboratory will be initially seeded with full-text content from Elsevier and other publishers, which will then be annotated by several automated and manual annotation efforts.
As a corpus to work on for this computational laboratory, Elsevier and MGH/Harvard are planning to enable annotation on a collection of > 3,000 Elsevier and other full-text papers via various external markup tools. This will allow the rapid and scalable testing of different annotation experiments, such as:
- Adding entity identifiers to specific biological and chemical entities in papers to create open linked meta-data “clouds” around text within scientific papers;
- Adding ‘rhetorical’ identifiers at a clause or sentence level, that enable the creation of Assertion-based annotation relevant to the ‘Hypothesis, Evidence and Relations’ (HypER) framework.
The focus of the computational laboratory will be to define and provide a standard web-services interface to allow these types of annotation to be generated by separate processes residing elsewhere on the web, and their results fed into a common process stream and metadata-annotation graphs. The project will entail developing a standard RESTful web-services capable of being interfaced to input from various alternative approaches to (semi-) automated markup of claims and evidence for life science content, as proposed under the HypER framework . Next we hope to compare and contrast manually created claims with the outcome of automated assertion annotators, such as:
i. Tools developed at the University of Manchester for ‘Meta-Knowledge Annotation of Bio-Events’: specifically specifying varying levels of author confidence (Certainty Level Markup) for specific biological statements ;
ii. Tools developed at the Universities of Aberystwyth and Cambridge for identifying Argumentative Zones in biology and chemistry papers ;
iii. Tools developed at Xerox Research Centre Europe to identify ‘core paradigm shifts’ in biology texts .
The curated semantic metadata created in this laboratory will be made freely and openly available under Creative Commons (http://creativecommons.org/) license.
This means that all participating researchers, biomedical web communities such as Alzforum (http://www.alzforum.org) and PD Online (http://pdonlineresearch.org) as well as to interested bioinformatics, search, and database professionals will all be able to freely access the metadata, which has an open specification.
A conceptual model of the workspace and its components and relationships to other web collaboratories, is shown in Figure 1.
Figure 1. Conceptual architecture of a shared-semantics textmining workspace.
We believe this kind of well-supported computational laboratory can not only attract many computational text mining participants, but can produce extremely valuable metadata to support enhanced, biology-focused web search algorithms potentially synergistic with Google Scholar, PubMedCentral and/or other bibliographic repositories.
1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.
7. Pettifer S, Wolstencroft K, Alper P, Attwood T, Coletta A, Goble C, Li P, McDermott P, Marsh J, Oinn T et al: myGrid and UTOPIA: An Integrated Approach to Enacting and Visualising in Silico Experiments in the Life Sciences. In: Data Integration in the Life Sciences. Edited by Cohen-Boulakia S, Tannen V, vol. 4544: Springer Berlin / Heidelberg; 2007: 59-70.
13. Breslin JG, Bojars U, Passant A, Fernandez S, Decker S: SIOC: Content Exchange and Semantic Interoperability Between Social Networks. In: W3C Workshop on the Future of Social Networking. Barcelona, Spain; 2009.
15. Waard, A. de, Buckingham Shum, S. Carusi, A., Park, J., Samwald, M. and Sándor. Á. Hypotheses, Evidence and Relationships: The HypER Approach for Representing Scientific Knowledge Claims. Proceedings of SWASD 2009, co-located with ISWC-2009.
16. Nawaz R, Thompson P, McNaught J, Ananiadou S. Meta-Knowledge Annotation of Bio-Events. Submission to LREC 2010
17. Liakata M, Teufel S, Siddharthan A, Batchelor C. Corpora for conceptualisation and zoning of scientific papers. Submission to LREC 2010
18. Sándor Á and Vorndran A. Detecting Key Sentences for Automatic Assistance in Peer Reviewing Research Articles in Educational Sciences. Proc. Workshop Text and Citation Analysis for Scholarly Digital Libraries, Assoc. for Computational Linguistics, 2009, pp. 36–44.