Workshop Papers‎ > ‎

    Peter Murray-Rust: Quixote and GreenChainReaction

    I'd like to present two of our projects (Quixote ( and GreenChainReaction (, both
    of which are aimed at creating semantically enriched data objects in
    physical science. (I think there are important and valuable technical issues
    between how physical scientists think about data and semantics from - say -
    bio/medical science).

    Both are bottom-up projects in that they involve web-based contributors
    without an overarching coordinating body. They are open science (all the
    work is completely available on the Net as soon as it is published). They
    also build their semantics "bottom-up" - i.e. look to see what "discourse"
    is used in the domain and try to formalize this. There are probably about 30
    people involved (and theye will be more by January 17th) so it doesn't make
    sense to give an author list - but the projects themselves will of course
    list contributors.

    These projects are disruptive technology in the same sense that Wikipedia or
    Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07
    that the reaction to WL was via extra-legal methods). I don't want to
    re-enter my polemics but it is factually correct that the established
    organizations in physical science (most publishers, most learned socs, some
    univs, some funders) are indifferent or antagonistic. If BTPDF ignores this
    then its results can only be cosmetic. I believe that its factually true to
    say that text-mining is currently crippled by the lack of access to freely
    available and Open scientific content and must be redressed. I have tried to
    engage with 3-4 major (closed) publishers of chemistry over 5 years and the
    only thing I have achived is a small corpus for testing purposes under CC-NC
    from one. One hasn't bothered to reply. Therefore chemistry will either
    remain a semantic desert or there will be a bottom-up revolution.

    So far I seem to be the only one addressing item 4 (IPR).

    On the more positive side we will succeed in our bottom-up projects to
    create semantics and ontologies for chemical objects and discourse. in
    GreenChainReaction we analysed ca 10,000 patents from the EPO and carried
    out semantically based text mining at a medium depth level (i.e. entity
    recognition, phrase recognition and default tree-banking). This showed that
    a deeper level of NLP gives much better precsion over textual entity
    recognition (which is often too imprecise to be useful). We shall be
    re-running this exercise and present the results at BTPDF where we shall be
    using USPTO patents to create about 200-500,000 reactions in complete
    semantic form. This will - we believe - have advatanges over the current
    commercial extraction of chemistry into reaction databases - unfortunately
    publishers forbid us to apply the technology to research articles and
    publish the results. So GCR builds up a resource of all objects published in
    chemical reactions and this should allow us to create a complete discourse
    ontology of reactions. (BTW anyone interested in text-mining will be welcome
    to take part).

    GCR is an after-the-fact markup although the technology could - in principle
    - be used in the authoring process. It's a question of communal will, not

    Quixote represents semantics-at-source and marks up the output of
    computational chemistry calculations. It's common to publish "articles"
    which just describe calculations, though it's also common to find them as
    support for experimental work. Almost invariably the detailed results are
    never published though it's trivial to do so and the space is not a problem.

    the reason for this problem is purely cultural and commercial. Most
    calculations are carried out by closed source for-money programs and there
    is an implicit policy of non-interoperability at the syntax, semantic and
    ontological level. The companies compete at least partially through lockin
    and inertia which means there is no incentive to create an ontology.

    Quixote believes that there *is* an underlying stable ontology and that by
    using the common programs, and exposing their results in semantic form
    (Chemical Markup Language) we will be able to create a core ontological
    abstraction. This is not as ambitious as it seems - the equations and
    fundamental physics are universal and stable for about 80 years or more. By
    creating this ontology it will be possible to add annotation at the time
    data are emitted from the calculation. It means that all calculations (we
    guess about 100 million per year or more) will be available to the whole
    community as Open data. And again anyone can join in.

    These projects tick boxes 1.1, 1.2, 2.2, 2.3, 2.4 They also show in great
    detail two enthusiastic communities working on Use Cases (box 3).