My research activities are centered around the processing of medical information. During my PhD, I studied the compilation of multilingual medical resources to build terminologies and lexical resources. That involved linguistic processing, as well as data mining techniques such as machine learning. In Nanyang Technological University, my research focused on opinion mining in the medical domain, which involved compiling resources, corpus analysis and linguistic processing. I am now working on Khresmoi project, which aims at creating a multilingual and multimedia platform for accessing biomedical information. DCU's role in this project mainly involves leading evaluation of the system (both empirical and user-centered), multilingual information access support, designing collaborative functionalities and designing results summarisation.
2015: ECIR program committee
2014: CIKM program committee
2014: COLING publication co-chair
2014: CLEF ehealth co-chair
2013: reviewing papers for the Journal of Medical Internet Research (JMIR)
2013: Member of the organization committee of CLEF eHealth evaluation lab
2013: SIGIR publication co-chair
1. Postdoc in Dublin City University
Khresmoi system aims at building systems for multilingual multimodal search and access for biomedical information sources. Our research group is involved in three workpackages related to biomedical text mining and search, user interface and search system, multilingual resources and information delivery; and leading the workpackage in charge of the evaluation of the project and the system developed. My work for all of these workpackages consists in conducting research work, development, as well as managing the progress, organise meetings and teleconferences, and writing deliverables and research papers. During my .first year in the project, we have published two papers in refereed workshops: one on the development of collaborative functionalities
for medical information systems (LREC workshop) and one on the development of a large-scale user-centred evaluation within Khresmoi (CLEF eHealth workshop).
2. Postdoc in Nanyang Technological University
Social media are commonly used to express opinions about interesting subjects. Our objective is to develop an eff.ective method for sentiment analysis and summarization of social media content, especially in health and medical .fields. As target domains, we focus on drugs. We aim to build a web-based system to provide a summarized view of public opinions.
A sentence-based system has been built to achieve semantic annotation of the sentences, based on medical thesaurus semantic types (e.g. Chemical & drugs, Symptom), and then predict sentiments toward various aspects (e.g. side eff.ects, cost) of a drug using machine learning and linguistic approaches. This project has led to two publications in international refereed conference and one journal paper.
Title:Characterization and compilation of specialized comparable corpora
JurySupervisor : Béatrice Daille, University of Nantes
Advisor : Emmanuel Morin, University of Nantes
President : Alexandre Dikovsky, University of nantes
Reviewers : Monique Slodzian, National Institute of Oriental Languages and Civilizations and Pierre-François Marteau, University of South Brittany
Other member : Kyo Kageura, University of Tokyo
Keywords: Comparable corpora, specialized languages, stylistic analysis, multilingual typology, type of discourse, machine learning
Comparable corpora are sets of texts written in different languages that are not translations of each other but that share common characteristics. Their main advantage is to be fully representative of linguistics and cultural specificities of their respective language. The Web could theoretically be considered as a comparable corpora source. However, the quality of corpora and of their extracted resources depends on the preliminary definition of corpora and on the carefulness of their compilation (i.e. the definition of common features in comparable corpora). In this thesis, we focus on the compilation of specialized comparable corpora in French and Japanese which documents are extracted from the Web. We propose a definition of these corpora and a set of common features: a specialized domain, a topic and a type of discourse (science or popular science). Our goal is to create a tool to assist comparable corpora compilation. First, we present automatic recognition of common features. Topics can be easily identified with keywords used in Web searches. On the contrary, the detection of the type of discourse needs a wide stylistic analysis. This task is performed over a learning corpus, which leads to the creation of a bilingual typology based on three levels of analysis: structural, modal and lexical. Second, we use this typology to learn a classification model with SVMlight and C4.5. This classification model is tested over an evaluation corpus. Our test results indicate that more than 70% of the documents are well classified. Finally, the classifier is integrated into a comparable corpora compilation assistant tool developed on UIMA system.
The Khresmoi project aims to develop a multilingual multimodal search and access system for biomedical information and documents. Khresmoi is adopting a user-centred approach to designing medical information search tools, for which three groups of end users are defined: general public, physicians and radiologists. This project is a 4 years 7th Framework Programme EU project that started in September 2010. Twelve partners are involved, from 9 EU countries.
ANR project METRICC:
The aim of this project is to explore the use of comparable corpora in three contexts: multilingual information extraction, translation memories creation and multilingual categorization. This project involves 6 research institutes: LIG (Grenoble, France), LINA (Nantes, France), Valoria (Vannes, France) and Lingua et Machina, Sinequa and Syllabs (French private research companies). My role in this project is to supervise a linguistics master student, working on the improvement of contextual information detection for multilingual alignment.
TCAN-CNRS 2004-2006. This project involves research institutes: LINA, INALCO (National Institute of Oriental Languages and Civilizations, Paris), Xerox Research (Grenoble) and NII (National Institute of Informatics, Tokyo). The aim of this pro ject was to use comparable corpora involving French, Japanese and Russian languages in order to gather multilingual information. My master thesis focuses on the compilation of the French part of the corpus and the extraction of information from it.