The Portable Information Access project (PIA) ran from 2000 to 2004.
About
The Portable Information Access (PIA) project aims to develop a domain adaptable information extraction (IE) system for annotating semantic content in texts. In contrast to other Web-based technologies such as information retrieval (IR) which are characterized by strong portability, no such system as yet exists for IE.
The Semantic Web
The Semantic Web as an entity is likely to have considerable social and economic impact. Tim Berners-Lee, the inventor of the World Wide Web (Web), has termed it the second generation Web (Berners-Lee, 1999) and many of those in the Web community expect it to have at least as great an impact on the lives of ordinary users as the first generation Web based on HTML that is familiar to us today. Basically the Semantic Web is being designed so that computer programs can 'understand' the meaning of information in Web resources such as documents and carry out sophisticated tasks for users. The Semantic Web will be an extension of the Web that we all use today.
The 'smart' applications that we hope to see emerge on the Semantic Web include smart browsers, question answering systems, automatic data formatting for different devices in wireless mobile ubiquitous data networks, and support for electronic shopping and appointments through agent software. All of these services incorporate intelligent problem solving services based on knowledge provided within the Semantic Web framework. The Semantic Web is likely to be based on ontologies and metadata. Formally an ontology may be considered to be "a specification of a conceptualization" (Gruber, 1993) and is used mainly for knowledge sharing and re-use. Basically ontlogies will describe the data that users wish to exchange.
In PIA we are concerned with domain ontologies that are created by domain experts and can be used to define a set of concepts and their relations for a specific group of users (domain-users) who share a common conceptualization for a given domain. The ontology should also define a set of axioms that can be applied to the concepts. The ontology needs to be implemented in a practical way, i.e. as an engineering artefact, for access by other software components in PIA. The RDF framework, its proposed extensions such as DAML+OIL (Hendler, 2000) (VanHarmelen. 2000) and the software tools currently being built to create annotations that conform to them provide the support framework within which Semantic Web applications operate.
Role of PIA
We consider that the high cost of semantic annotation and its required expertise are potential bottlenecks in the spread of the Semantic Web. PIA is therefore concerned with machine learning for text-to-knowledge conversion so that computer programs can learn how to annotate new Web-based texts based on a relatively small number of examples of annotated texts in the domain.
PIA has its foundations built on four resources which are currently being developed:
The working realization of these resources will be a system called "Ontology Forge."
Each of these resources is described below in relation to the three main knowledge sources: texts, annotations and ontologies.
Open Ontology Forge (OOF)
OOF has been designed for ontology definition and capturing instances of concepts in texts.
The tool aims to aid annotators who are experts in their domain in marking up so-called named entity expressions such as technical terms that we want to distinguish from the rest of the text. The tool also provides for annotation of coreference expressions. We provide output of instances in both RDF and in-line XML-style formats for training named entity recognizers.
PIA-Core
We believe that machine learning from examples is worth exploring as a way to reliably replicate the capabilities of experts and this is the goal of PIA-Core. Semantic annotation of texts necessarily involves automatically identifying and classifying technical terminology and proper nouns, finding functional values and finding instances of axiomatic relations. In PIA-Core are approaching this using supervised machine learning from annotated texts combined with a user-provided ontology (domain model). We are currently investigating a range of classifier models such as SVMs (Support Vector Machine), Maximum Entropy and Decision Trees (C5) that combine the knowledge available in the ontology with linguistically motivated features available from robust natural language processing tools such as a part of speech tagger and a shallow parser.
PIA Annotation Management (PAM) System
In order to facilitate collaborative development on ontologies as shared domain conceptualizations between experts and to encourage the sharing of annotations we are developing the PIA Annotation Management System. The basic user model allows for three types of user privilege depending on the level of the user's expertise. The domain manager takes overall responsibility for project management, registering members and version control including releasing a public version of the ontology to the Web; the domain experts takes responsibility for forming the ontology itself, while the domain user takes public ontologies and annotates documents according to the concepts, properties and relations given there. The public ontologies are available on the Ontology Forge server for users and software agents to view and refer to.
Publications
Members
Funding
PIA is supported by funds from the JSPS Science Resarch Fund for Young Researchers (Ref. 14701020) and the National Institute of Informatics.