ZAISA

The Zone Analysis In Scientific Articles (ZAISA) project ran from 2003 to 2006.

About

Text mining in the scientific domains such as molecular biology is now recognized by scientists as an essential technique for enabling access to information contained in archived journal articles and abstract collections such as MEDLINE: Since major domain databases often contain incomplete and contradictory results over generations of experiments, scientists need to return to the source journal articles to confirm and complete templates of factual information. While major progress has been made by the international research community on bio-named entity extraction, i.e. the identification and classification of terminology, further progress aimed at pin-pointing facts related to experimental results has proven to be difficult. Current technology does not suffice to identify the main result of the investigation among the large number of facts mentioned in each article, most of which relate to the work of others. In this project, we propose to create a critically needed knowledge source that will enable future research progress for the purpose just mentioned: a publicly available training collection of journals annotated for rhetorical regions called zones.

Our notion of zone is essentially different from the core notion used in the widely known rhetorical structure theory (RST) (Mann and Thompson, 1987), that is, the hierarchical organization of propositions (or clauses). We emphasize a small set of classes with shallow nesting where the importance of the internal structure of clauses on the logical basis is not definitive. Rather, our approach emphasizes empirically determined class transitions together with weak linguistic clues used as feature information for machine learning. In the second year, we will use the annotated collection to automatically classify zones, using the machine learning methods we are already familiar with such as support vector machines and maximum entropy. For example, where the author mentions his/her specific aim, background publications, his/her experimental design, specific results of the experiments and their implications, discusses the significance of his/her findings in comparison with previous work, etc.

Example

The following is an illustration of our zone analysis of a passage taken from the Result section of a NAR journal article; we have removed the content words to highlight the influence of other phrases (Numbering is added; “ref.” and “Fig.#” indicate respectively a reference and Figure.):

[1] Microarrays have been used to map replication in yeast (ref.). [2] We performed a similar experiment in Salmonella [3] to demonstrate that ….... [4] First, …... [5] The resulting plot represents the relative increase in gene copy number ……(Fig.#). [6] A similar experiment was performed using …… (Fig.#). [7] The position of genes …… are scrambled …... Yet, …... [8] As ……, this observation will be worthy of further research. [9] Deviations from …… are more profound in Typhi CT18. [10] We speculate that …… [11] Nevertheless, this experiment shows that …… [12] Finally, many genes were induced in ……[13] The results also confirm the recent discovery of MntH …… (ref.)

Fig. 1 - Example of zones in a fragment of a NAR journal article

In the above, [1] provides background information related to the author’s experiment (BACKGROUND), [3] mentions the aim of the experiment (PROBLEM SETTING), whereas [2], [4], and [6] state the experiments performed (METHOD). [5], [7], and [9] provide the results of the corresponding experiment (RESULT). [8] assesses the observation just mentioned (IMPLICATION). [10] provides the author’s conjecture on the result (IMPLICATION), whereas [11] provides an insight/finding obtained from (the result of) the experiment mentioned in [6] (INSIGHT). [12] provides another result of an experiment. [13] mentions the author’s insight obtained as well as relates the result to that in previous work (INSIGHT and CONNECTION). The annotation of the whole article (with a multi-colored representation) and the set of zone categories used are provided below.

Linkage to Related Work

The work on zone analysis is based partly on earlier work by Dr. Simone Teufel ( Cambridge , UK ) who examined journals in the computational linguistics domain. Molecular Biology domain texts, however, seem to be much more formally structured within the framework of experimental researches than the ones used in Teufel’s study and more focused on results. Our proposed research is unique in being focused on making sense of the myriad of results (and other aspects of the work) in each article as well as its application to text mining and the proposed use of machine learning.

Pre-analysis of EMBO journal articles (published in LREC’2004) by the NII group has shown that unlike content-based classification tasks in text mining such as the named entity task or event extraction which use technical terms as their major features, the identification of zones rests largely on the use of subtle clues such as common vocabulary words and phrases, section titles and the ordering of previous zone assignment. Given our hypothesis based on the pre-analysis that the task is feasible, we work on the ZAISA project in order to prove our ideas and make this a practical reality. In particular, we aim at the annotation of texts for the suggested zone classes and the proof that automatic classification is indeed possible using ‘weak’ linguistic clues.

Publications

    • Mizuta, Y., Korhonen, A., Mullen, T. and Collier, N. (2006), “Zone analysis in biology articles as a basis for information extraction”, International Journal of Medical Informatics, Elsevier, 75(6): 468-487. [pubmed]
    • Mullen, A., Mizuta, Y. and Collier, N. (2005), "A baseline feature set for learning rhetorical zones using full articles in the biomedical domain", SIGKDD Explorations, 7(1):52-58, ACM. [pdf]
  • Mizuta, Y., Kawamoto, S., Mullen, T., Kawazoe, A., and Collier, N. (2005), “Creation of a dataset for zone analysis in biomedical texts: the design process and preliminary investigations”, Proc. 1st International Symposium on Languages in Biology and Medicine, Daejeon, Korea, November 24-26.
  • Mizuta, Y. and Collier, N. (2004), "An annotation scheme for rhetorical analysis of biology articles", Proc. 4th International Conference on Language Resources and Evaluation (LREC'2004), Lisbon, Portugal, May 26-28. [pdf]

Members

  • Yoko Mizuta
  • Tony Mullen
  • Anna Korhonen
  • Nigel Collier

Appendix 1

Sample illustration of zone analysis of a full-text NAR journal article

Zone categories used in the NAR journal example are given below:

Note. DIFFERENCE and CONNECTION overlap with some of the other categories.