Workshop Papers‎ > ‎

Peter Murray-Rust: Quixote and GreenChainReaction

I'd like to present two of our projects (Quixote ( and GreenChainReaction (, both
of which are aimed at creating semantically enriched data objects in
physical science. (I think there are important and valuable technical issues
between how physical scientists think about data and semantics from - say -
bio/medical science).

Both are bottom-up projects in that they involve web-based contributors
without an overarching coordinating body. They are open science (all the
work is completely available on the Net as soon as it is published). They
also build their semantics "bottom-up" - i.e. look to see what "discourse"
is used in the domain and try to formalize this. There are probably about 30
people involved (and theye will be more by January 17th) so it doesn't make
sense to give an author list - but the projects themselves will of course
list contributors.

These projects are disruptive technology in the same sense that Wikipedia or
Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07
that the reaction to WL was via extra-legal methods). I don't want to
re-enter my polemics but it is factually correct that the established
organizations in physical science (most publishers, most learned socs, some
univs, some funders) are indifferent or antagonistic. If BTPDF ignores this
then its results can only be cosmetic. I believe that its factually true to
say that text-mining is currently crippled by the lack of access to freely
available and Open scientific content and must be redressed. I have tried to
engage with 3-4 major (closed) publishers of chemistry over 5 years and the
only thing I have achived is a small corpus for testing purposes under CC-NC
from one. One hasn't bothered to reply. Therefore chemistry will either
remain a semantic desert or there will be a bottom-up revolution.

So far I seem to be the only one addressing item 4 (IPR).

On the more positive side we will succeed in our bottom-up projects to
create semantics and ontologies for chemical objects and discourse. in
GreenChainReaction we analysed ca 10,000 patents from the EPO and carried
out semantically based text mining at a medium depth level (i.e. entity
recognition, phrase recognition and default tree-banking). This showed that
a deeper level of NLP gives much better precsion over textual entity
recognition (which is often too imprecise to be useful). We shall be
re-running this exercise and present the results at BTPDF where we shall be
using USPTO patents to create about 200-500,000 reactions in complete
semantic form. This will - we believe - have advatanges over the current
commercial extraction of chemistry into reaction databases - unfortunately
publishers forbid us to apply the technology to research articles and
publish the results. So GCR builds up a resource of all objects published in
chemical reactions and this should allow us to create a complete discourse
ontology of reactions. (BTW anyone interested in text-mining will be welcome
to take part).

GCR is an after-the-fact markup although the technology could - in principle
- be used in the authoring process. It's a question of communal will, not

Quixote represents semantics-at-source and marks up the output of
computational chemistry calculations. It's common to publish "articles"
which just describe calculations, though it's also common to find them as
support for experimental work. Almost invariably the detailed results are
never published though it's trivial to do so and the space is not a problem.

the reason for this problem is purely cultural and commercial. Most
calculations are carried out by closed source for-money programs and there
is an implicit policy of non-interoperability at the syntax, semantic and
ontological level. The companies compete at least partially through lockin
and inertia which means there is no incentive to create an ontology.

Quixote believes that there *is* an underlying stable ontology and that by
using the common programs, and exposing their results in semantic form
(Chemical Markup Language) we will be able to create a core ontological
abstraction. This is not as ambitious as it seems - the equations and
fundamental physics are universal and stable for about 80 years or more. By
creating this ontology it will be possible to add annotation at the time
data are emitted from the calculation. It means that all calculations (we
guess about 100 million per year or more) will be available to the whole
community as Open data. And again anyone can join in.

These projects tick boxes 1.1, 1.2, 2.2, 2.3, 2.4 They also show in great
detail two enthusiastic communities working on Use Cases (box 3).