PIA

The Portable Information Access project (PIA) ran from 2000 to 2004.

About

The Portable Information Access (PIA) project aims to develop a domain adaptable information extraction (IE) system for annotating semantic content in texts. In contrast to other Web-based technologies such as information retrieval (IR) which are characterized by strong portability, no such system as yet exists for IE.

The Semantic Web

The Semantic Web as an entity is likely to have considerable social and economic impact. Tim Berners-Lee, the inventor of the World Wide Web (Web), has termed it the second generation Web (Berners-Lee, 1999) and many of those in the Web community expect it to have at least as great an impact on the lives of ordinary users as the first generation Web based on HTML that is familiar to us today. Basically the Semantic Web is being designed so that computer programs can 'understand' the meaning of information in Web resources such as documents and carry out sophisticated tasks for users. The Semantic Web will be an extension of the Web that we all use today.

The 'smart' applications that we hope to see emerge on the Semantic Web include smart browsers, question answering systems, automatic data formatting for different devices in wireless mobile ubiquitous data networks, and support for electronic shopping and appointments through agent software. All of these services incorporate intelligent problem solving services based on knowledge provided within the Semantic Web framework. The Semantic Web is likely to be based on ontologies and metadata. Formally an ontology may be considered to be "a specification of a conceptualization" (Gruber, 1993) and is used mainly for knowledge sharing and re-use. Basically ontlogies will describe the data that users wish to exchange.

In PIA we are concerned with domain ontologies that are created by domain experts and can be used to define a set of concepts and their relations for a specific group of users (domain-users) who share a common conceptualization for a given domain. The ontology should also define a set of axioms that can be applied to the concepts. The ontology needs to be implemented in a practical way, i.e. as an engineering artefact, for access by other software components in PIA. The RDF framework, its proposed extensions such as DAML+OIL (Hendler, 2000) (VanHarmelen. 2000) and the software tools currently being built to create annotations that conform to them provide the support framework within which Semantic Web applications operate.

Role of PIA

We consider that the high cost of semantic annotation and its required expertise are potential bottlenecks in the spread of the Semantic Web. PIA is therefore concerned with machine learning for text-to-knowledge conversion so that computer programs can learn how to annotate new Web-based texts based on a relatively small number of examples of annotated texts in the domain.

PIA has its foundations built on four resources which are currently being developed:

    • Open Ontology Forge - an integrated ontology editor and annotation tool;
    • PAM - an annotation management system which will be a client-server system for domain experts and users to easily and securely share texts, annotations and ontologies;
    • PIA-Core - a set of domain adaptable machine learning tools so that the server-side can semi-automatically annotate user's Web pages with semantic content based on already submitted examples.

The working realization of these resources will be a system called "Ontology Forge."

Each of these resources is described below in relation to the three main knowledge sources: texts, annotations and ontologies.

Open Ontology Forge (OOF)

OOF has been designed for ontology definition and capturing instances of concepts in texts.

The tool aims to aid annotators who are experts in their domain in marking up so-called named entity expressions such as technical terms that we want to distinguish from the rest of the text. The tool also provides for annotation of coreference expressions. We provide output of instances in both RDF and in-line XML-style formats for training named entity recognizers.

PIA-Core

We believe that machine learning from examples is worth exploring as a way to reliably replicate the capabilities of experts and this is the goal of PIA-Core. Semantic annotation of texts necessarily involves automatically identifying and classifying technical terminology and proper nouns, finding functional values and finding instances of axiomatic relations. In PIA-Core are approaching this using supervised machine learning from annotated texts combined with a user-provided ontology (domain model). We are currently investigating a range of classifier models such as SVMs (Support Vector Machine), Maximum Entropy and Decision Trees (C5) that combine the knowledge available in the ontology with linguistically motivated features available from robust natural language processing tools such as a part of speech tagger and a shallow parser.

PIA Annotation Management (PAM) System

In order to facilitate collaborative development on ontologies as shared domain conceptualizations between experts and to encourage the sharing of annotations we are developing the PIA Annotation Management System. The basic user model allows for three types of user privilege depending on the level of the user's expertise. The domain manager takes overall responsibility for project management, registering members and version control including releasing a public version of the ontology to the Web; the domain experts takes responsibility for forming the ontology itself, while the domain user takes public ontologies and annotates documents according to the concepts, properties and relations given there. The public ontologies are available on the Ontology Forge server for users and software agents to view and refer to.

Publications

  • Collier, N. and Takeuchi, K. (2004), “Comparison of character-level and part of speech features for name recognition in bio-medical texts”, Journal of Biomedical Informatics, 37(6): 423-435, Elsevier. [pubmed][bio1 named entity data]
  • Kawazoe, A. and Collier, N. (2003), "Open Ontology Forge: a tool for ontology creation and text annotation in a biomedical domain", Proc. 14th Conference on Genome Informatics, Yokohama, Japan, December 14-17, pp. 677-678.
  • Kawazoe, A. and Collier, N. (2003), "Open Ontology Forge: application of a tool for ontology creation and text annotation to cultural heritage information", Proc. Nara Symposium for Digital Silk Roads, Nara, Japan, December 10th, pp. 395-401.
  • Takeuchi, K. and Collier, N. (2003), “Bio-medical entity extraction using a support vector machine”, in Proc. Workshop on Natural Language Processing in Biomedicine (BioNLP) at ACL’2003, Sapporo, Japan, July 11th, pp. 57-64.
  • Collier, N., Takeuchi, K. and Kawazoe, A. (2003), “Open Ontology Forge: an environment for text mining in a Semantic Web world”, Proc. International Workshop on Semantic Web Foundations and Application Technologies, Nara, Japan, March 11th, pp. 17-24.
  • Kawazoe, A. and Collier, N. (2003), “An Ontologically-motivated annotation scheme for coreference”, Proc. International Workshop on Semantic Web Foundations and Application Technologies, Nara, Japan, March 11th, pp. 85-88.
  • Takeuchi, K. and Collier, N. (2002), “Use of support vector machines in extended named entity recognition”, Proc. 6th Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, August 31st – September 1st, pp. 119-125.[pdf][bio1 named entity data]
  • Collier, N., Takeuchi, K., Nobata, C., Fukumoto, J. and Ogata, N. (2002), “Progress on multi-lingual named entity annotation guidelines using RDF(S)", Proc. 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, May 29th – 31st, pp. 2074-2081.
  • Collier, N. and Takeuchi, K., (2002), “PIA-Core: Semantic annotation through example-based learning”, Proc. 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, May 29th – 31st, pp. 1611-1614. [pdf]
  • Collier, N., Takeuchi, K. and Tsuji, K. (2001), “The PIA Project: learning to semantically annotate texts from an ontology and XML-instance data”, in position paper proc. 1st Semantic Web Working Symposium (SWWS’2001), Stanford University, California, USA, July 30th – August 1st, pp.8-9.
  • Collier, N. (2001), “Machine learning for information extraction from XML markup-up text on the Semantic Web”, Proc. Semantic Web Workshop at the Tenth International Conference on the World Wide Web (WWW’10), Hong Kong, May 1-5, pp. 29-36.[pdf]

Members

    • Ai Kawazoe (NII, now at Tsuda College)
    • Tony Mullen (NII, now at Tsuda College)
    • Koichi Takeuchi (NII, now at Okayama University)
  • Nigel Collier (NII and JST)

Funding

PIA is supported by funds from the JSPS Science Resarch Fund for Young Researchers (Ref. 14701020) and the National Institute of Informatics.