Links to blogs and models, partly coming from the discussion forum:

Annotation links:

New formats for publishing science:

Modular formats for Science Publishing:
Basically, these propose to go to a more granular level of the scientific paper; aka 'smallest publishable unit' being smaller than the size of a full paper.
  • Modular Physics Paper:  University of Amsterdam (1999).
    A modular form for Physics papers: by investigating a collection of papers, a more fine-grained structure for science papers and an extensive relationships taxonomy is proposed
  • LiquidPub. EU Project, U Trento and others (2008- 2011)
    A 'liquid' format for science papers is proposed, that consists of a set of research objects, connected by links.
  • 'Coarse-grained rhetorical structure', work done in the HCLS SiG of the W3C, 2009 - now.
    This group aims to define a 'rhetorical structure' for scientific papers, to use in authoring or mark-up tools. They are trying to come to a definition of such a format; have an intermediary proposal of their own and are beginning to make an overview of existing publisher's proposals.
  • The abcde format - Utrecht University, 2007
    The abcde format is a proposal for a simple, structured format for conference papers in computer science, that is based on LaTeX. Each paper consists of three sections: Background, Contribution, and Discussion, and three added elements: A = Annotation, Dublin Core annotation; E = Entities, these are RDF-formatted entities of interest, including references, and (no contribution to the acronym) Core Sentences: these are sentences that are marked up by the author to be core elements. They can be extracted to form a structured abstract.
  • NanopublicationsNBIC, the Netherlands Bioinformatics Centre
    The notion of a 'nanopublication' is basically a general scientific assertion, written using semantic-web standard formats with additional meta-data concerning provenance. 
Modeling Science (Especially Biology) as Triples:
The main idea is that science should be represented as a set of triples. There is a special interest in this representation within biology and life sciences. Some intiiatives include:
  • The Structured Digital Abstract, Seringhaus/Gerstein, 2008
    This paper basically proposes to include a 'structured XML-readable summary of pertinent facts'
  • FEBS Letters SDA, 2008 - now
    The journal FEBS Letters adds curator-created triples on Protein-Protein interaction to every appropriate paper
  • CWA Nanopublications - 2010
    The Concept Web Alliance proposes to model scientific research as sets of triples; the first definition of the format The Anatomy of a Nanopublication has just been published
Hypothesis/Claim-Based representation of the Argument Structure of a Scientific Paper:
These projects all start with the assumption that a scientific paper is, at heart, a persuasive text that makes a number of claims, that are backed by research data and references. The paper is then represented with a set of hypotheses, linked to evidence (in the form of data or references)
  • Cohere, KMI, 2007- now
    The Cohere project, which builds on the earlier 'ClaiMaker' project, offers a web-based interface to create claims, hypotheses, or statements, and relate these to other claims using an open set of relationships. It is usable for science, but also for structuring online debateson other topics. 
  • SWAN, Alzheimer's Network, Harvard IIC, 2006 - now:
    The SWAN project adds a collection of hand-curated hypotheses to a research paper, which are then related through a set of discourse relationships. They can be browsed and relations between claims, as well as support networks for a specific claim, are made and visualised. 
  • SALT, DERI, 2008
    SALT is a LaTeX-based authoring tool that allows authors to identify Rhetorical Structure Theory (RST-)relations between sentences in their paper. It offers the author the opportunity to define main and secondary (sattelite) sentences and create relations between them.
  • aTags, DERI, 2009- now
    aTags aTags ("associative tags") are snippets of HTML that capture the information that is most important to you in a machine-readable, interlinked format. aTags works with any Web text and can store and connect any textual element that is highlighted in a browser. 
  • Hypotheses in Biology, UvA, 2009
    A methodology and set of proto-ontologies in OWL for capturing different aspects of a text mining experiment: the biological hypothesis, text and documents, text mining, and workflow provenance.
  • HyBrow, Stanford, 2008
    A prototype bioinformatics tool for designing hypotheses and evaluating them for consistency with existing knowledge
  • HypER: 2009 - now
    HypER Is an ad-hoc group of researchers who all represent scientific communications as a set of hypotheses, with relations to evidence. It includes representatives of LiquidPub, Cohere, SWAN, SALT, aTags and abcde work. The main focus of HypER has shifted to the W3C HCLS work on Scientific Discourse structures.
Semantic Publishing Initiatives and enriched forms of publication:
  • The Semantic Biochemical Journal - 2010:
    Using Utopia, an innovative PDF reader, this allows enrichment of the PDF with interactive figures and active data
  • Article of the Future, Cell, 2009:
    Tabbed and hyperlinked presentation of the article; Graphical Abstract and Highlights on the landing page
  • Prospect, Royal Society of Chemistry, 2009:
    RSC editors annotate compounds, concepts and data within the articles and linking these to additional electronic resources such as biological databases
  • Adventures in Semantic Publishing, Oxford U, 2009:
    A hand-marked up version of paper in Epidemiology with data enhancements and better browsing and reference linking
  • Open Access journals published by Pensoft come with semantic enhancements. Example: PhytoKeys.
Authoring Tools that support semantic applications

Rhetorical relations between scientific papers:
Various efforts to create relationship taxonomies/links between citing/cited papers:
  • SWAN/SIOC/CIto alignment, 2010, HCLS SiG of W3C:
    This is an effort to align two citation relationship ontologies: SWAN, used for the SWAN project and SIOC
    Two other ontologies are being considered: CiTO and relations to the publisher-centric PRISM standard
  • Modelling literature citations as RDF (2010 onward)
    A public RDF triplestore of biomedical literature citations encoded as Open Linked Data, and characterized using CiTO, the Citation Typing Ontology
  • this is the precursor of the Cohere work; the relationship ontology is available in RDF

Structuring Methods and Experiments:
Several initiatives aim to structure Methods and Experiments to include in papers. Some of these include:
  • My Experiment:
    a platform to create and exchange experimental workflow components
  • Knowledge Engineering from Experimental Design (KEfED)
    A structured way of constructing 'observational assertions' based on statistical relationships from experiments. The model is general-purpose and forms a basis for reasoning over experimental data. 
  • Ontology of Biomedical Investigation  
    A broad-based community effort to develop an ontology that provides a representation for biomedical experiments.
  • Investigation/Study/Assay (ISA) infrastructure is a general-purpose format and freely available desktop software suite targeted to curators and experimentalists that assists in management of experimental metadata, engages with minimum information checklists and ontologies and formats studies for submission to international public repositories (ENA for genomics, PRIDE for proteomics and ArrayExpress for transcriptomics).

  • VisTrails: an open-source data analysis and visualization tool that supports the creation of documents whose results have deep captions that point to their provenance, and thus can be reproduced and verified. Provenance-rich results derived by VisTrails can be included in  LaTeX, Wiki, Microsoft Word and PowerPoint documents.
  • crowdLabs: a platform for sharing and executing computational tasks
Computational Linguistics/Text Mining Efforts:
There is a large amount of work in this area, a few projects include:
  • Hypothesis identification at Xerox:
    Using the Xerox Integrated Parser, trying to find key statements in biology research papers
  • Argumentational Zoning, work by Simone Teufel and others:
  • In-Context Summaries, Macquarie University in Australia:
    Providing a summary of referred-to materials, weighted by the referring sentence
  • Metaknowledge annotation of biomedical events, NaCTeM, University of Manchester
    • Annotation of interpretative information for biomedical events along 5 different dimesions: Knowledge Type (fact, analysis, observation, etc), Certainty Level, Polarity, Manner and Source.
  • Automatic recogntion of sentence types in biomedical abstracts (Tsujii lab, University of Tokyo) - title, conclusion, method, objective, result - see MEDIE (advanced search) for demo
  • GENIA (Tsujii Lab, University of Tokyo) and GREC (NacTeM, University of Manchester)- corpora of anotated with biomedical events - allow systems to be trained to automically idenify and structure relevant information in biomedical documents.
  • AcroMine (NaCTeM, University of Manchester) - automtically determines full forms of acronyms
  • Linking to biomedical Named Entities in document to related database entries - such links are provided in the BioLexicon. Examples of search engines providing such links are MEDIE and UKPMC
  • U-Compare - An integrated text mining/natural language processing system based on the UIMA Framework, allowing documents to be processed by various text mining tools.

Author Identification:

    A key part of science is knowing the provenance of a paper, experiment, data item, etc. See (reproducible research below). Provenance includes attribution, sources, experimental workflow, citations, quotes, i.e. who, what, when where, why.
  • W3C incubator group on provenance -  mission was to provide a state-of-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization. Finishes Dec. 2010.
  • Open Provenance Model - a model for the interoperable exchange of provenance information arising out of a series of Provenance Challenges focusing on understanding the compatibility and interchange of information between provenance systems
  • A comprehensive review of provenance research: Moreau, L. (2010) The Foundations for Provenance on the Web. Foundations and Trends in Web Science, 2 (2--3). pp. 99-241. ISSN 1555-077X
Reward Systems:
  • Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia (see attached file).
  • -  describes the role for new measures of scientific impact based on web activity
Data citation:

What PDFs do right:
File and format
  • self-contained in a small single file, and thus easy to manage
  • readable on every platform that I use
  • no one will ever revoke my right to read them (even if I change institution, job etc)
  • well-typeset
  • typically devoid of clutter
  • Article Of Record
    • lots of independent copies--tamper-resistent
  • ability to search entire file (Ctrl-F)
  • can see entire thing with minimal clicking between sections
  • "they are single encapsulated files which you give you security of access, on all platforms in all circumstances, and are easy to manage"
  • "they make for easy/attractive reading, presenting a narrative in a digestible way."

  • linear narrative, author's argument and thought processes
What PDFs do wrong:
  • sized for printing, not laptop-screen (let alone hand-held)
  • challenging to annotate on-screen with most tools
  • not granular -- for citing and reading
Hope for PDFs:
Challenges and opportunities for PDFs/communicating paper contents themselves:
  • Reproducibility
    • Data and methods sections need to be digital for reproducibility
    • need a system to capture the analysis automatically and then
      an easy way to embed in the manuscript itself

    • Reproducible Research System (RRS) with two components (per Accessible Reproducible Research by Jill P. Mesirov (Science, 327:415, 2010).)
      • an environment for doing the work (Reproducible Research Environment, or RRE)
      • an authoring environment (Reproducible Research Publisher, or RRP)
        that provides an easy link to the RRE
  • Living documents, that are auto-updated with new information
  • "flexible layout, great typography, great reusability, pick two."
  • "Reading on the web starts to fall apart the moment you get anything longer than 1 or 2 pages."
  • argument-based indexing (smarter discovery by capturing assertions, etc.)
  • "Embedded XML and smart PDF reading software could potentially allow consumers to have some control over display (providing reflowable versions for different-sized screens) while keeping the printable producer-endorsed view."
Challenges and opportunities beyond the PDF itself, for science/IR/etc in general:
  • Tracking reuse
  • Motivation for sharing data, 
  •  'Domain Specific Reasoning Models' - provide a mechanism whereby one can reason over a model and generate hypotheses that can be tested experimentally. 
  • context-aware thesaurus substitution 

Ways of using papers:
  • Honed linear narrative (story that persuades with data): As a means of understanding some new concept. Assuming I've selected a particular article as being appropriate, I want well crafted, honed linear narrative that I can read on the bus. I want it to have been written by the world's expert on Subject X, and to have an idea explained to me clearly. I would expect to read it in totality, from top to bottom. At this stage the data is important to back up the narrative, but I'm likely to take whatever subset of data has been presented (somewhat) at face value, and am unlikely explore it right away. I think Anita's description of an article as a 'Story that persuades with data' hits this nail perfectly on its head.
  • Reference work: As a means of finding evidence for an idea of my own. Here I'm treating an article much more like a reference work: does it tell me that A interacts with B or that X is a kind of Y, and does the data that's associated with the article really back this up (or provide me with a way of drawing my own conclusions). In this phase I'm treating an article much more like a database entry, or as a mixed bag of facts, the validity of which I want to establish as painlessly as possible. I also want to be able to ask of the literature in general 'is there any article that claims that X is a kind of Y'.
  • Publication venue as a proxy for quality, used for promotion/tenure/etc ("quantifying of
    impact albeit implicitly by "sorting" publications into journals that have pre-defined windows of impact")
  • Annotation connecting various ideas or papers?
  • ...
Philip Bourne,
Nov 17, 2010, 4:26 AM