Working plan General Objective: the final goal of my work is to populate a knowledge base, based on the Microbio ontology about miRNAs and including modality information, and to study and develop visualization mechanisms for its interactive visualization. To achive the first goal, I plan to develop new information extraction learning methods to instanciate relations of the ontology [1], working on a corpus of scientific literature. To visualize information, I plan to study the complexity of the visualization of graphs, and how interaction could be used to allow the user to explore the data [2] Frame: the Microbio project. The project will use the tools, corpus and ontologies already built. Inputs: The microbio ontology, the modality ontology. Unannotated domain corpus. The web. Tasks: 1. Annotate gene names, miRNAs names and relations in the corpus, using available tools (like Abner) o lexico-syntactic patterns, and manually validating their results, with the help of domain researchers from the Institut Pasteur. 2. Develop supervised or semi-supervised learning methods for relation instanciation. This includes the development of kernels to improve relation extraction, based on tree or graph representations of parsed sentences and their annotations [3] 3. Clasify relation modality. Here I will study the context of validation of relations [4], and I will develop machine learning methods, incorporating modality information manually annotated on the corpus, and possibly building a new corpus from the web, automatically annotated using lexico-syntactic patterns. Based on those corpus, and using semi-supervised machine learning methods, I will classify (according to a modality ontology) each relation in the corpus. 4. Evaluate the results and refine the extraction and inference techniques. 5. Define and develop visualization methods to access the knowledge base. Using vizinfo techniques on vast amounts of data, interactive visualizations will be developed, which will allow Pasteur researchers to explorer the knowledge base. For example, they could access a general viz of the ontology, where most-referenced concepts would be enhanced. Or enhance the visualization of those relations that, according to the used inference algorithms, are more "secure", or more mentioned in the source documents. For each concept/relation, source documents could be visualized. Some aspects of modality could also be filtered (for example, "show all relations tagged as "doubt"). Or any other visualization requeriments defined by the Institut Pasteur researchers. Technology: Python/scipy for information extraction. For machine learning, Weka o SVM-Light. Por information integration and representation, UIMA. For information visualization, the Processing language. Product: the final product will be a platform based on UIMA that will allow researchers to: a) incorporate new documents, extract their informations and populate the ontology. b) Visualize the results Term: 2 years, ending en september 2011 [1] Relation Instantiation for Ontology Population using the Web, V. De Boer et al. [2006] [2] Visualizing Data, Fry, B. O'Reilly. [2008] [3] A composite Kernel to Extract Relations between Entities with both flat and structured features Zhang, M. et al. [2006] [4] Knowledge Claims in Scientific Literature, Uncertainty and Semantic Annotation: A Case Study in the Biological Domain, D. Battistelli, F.Amardeilh [2009] |