Since February 2012, I am junior professor at the Institute for NLP( IMS), University of Stuttgart in the framework of the collaborative research centre SFB 732. Before that, I was post-doc (maître-assistante) at the University of Geneva working in the field of cross-lingual transfer of semantic role labelling as part of the CLASSiC project. I earned my PhD from the University of Groningen, where I worked on automatic lexical acquisition from corpora within the Alfa-Informatica group. I was a visiting academic at the Division of Information and Communication Sciences of Macquarie University, Sydney from January till March 2007. I worked at ISSCO/TIM-ETI (University of Geneva) from 2002 until 2003. I worked in industry for one year at Systran Translation Systems in 2001-2002. Before that I did the M.Phil Computer Speech and Language Processing at the University of Cambridge. The M.Phil has now been renamed into Computer Speech, Text and Internet Technology.
I have been working on the following subjects: cross-lingual natural language processing, automatic lexical acquisition, text mining, (medical) terminology extraction, computational lexicology, question answering, semantic role labelling, probabilistic modelling, cross-lingual annotation transfer.
CLASSiC project (Computational Learning in Adaptive Systems for Spoken Conversation) we are focusing on semantic role labeling for French and in particular on methods to automatically generate semantic annotations for French. Syntactic annotation is available for French, but no semantic information. Since there is semantic annotation available for English and there are parallel corpora for the language pair English-French, we transfer the semantic annotation from English to French translations using word alignments. Contrary to previous work (Padó and Pitel, TALN 2007; Padó and Lapata, Comp. Ling. 2009; Basili et al. CICLing 2009), we did not use an ontology constructed for the target language. We want to minimize the amount of manual labour and aim for broad coverage annotations. We used the PropBank annotation framework constructed for English to annotate French sentences, after having tested the cross-lingual validity of PropBank (Van der Plas et al., LAW 2010). Because we know that there is a high correlation between syntax and semantics (see also Merlo and Van der Plas, ACL 2009), we leveraged the information contained in the syntactic annotations in a second step. In this step we trained a syntactic-semantic parser on the combination of syntactic annotations and the semantic annotations resulting from transfer. The automatically generated semantic annotations for French are close to the upper bound from manual annotations (Van der Plas et al., ACL 2011).
Watch a video of the current CLASSiC system.
Freedom and liberty share the same meaning. Paris denotes a city, and the word party triggers associations of wine and fun for many. People naturally acquire these lexico-semantic relations such as synonyms, categorised named entities, and associations by using language in their daily life.
For many natural language processing applications, such as question answering, this type of information is essential, e.g. to recognise that a particular meaning can be inferred from different text variants or to compensate for the lack of general world knowledge.
This thesis proposes three methods for using large text corpora to acquire lexico-semantic information automatically: a syntax-based method, a multilingual word-alignment-based method and a proximity-based method. The three methods complement each other in the type of data needed, the way they deal with sparse data and most importantly, in the types of lexico-semantic information they provide. This information is then applied to the Groningen question answering system Joost. Among the different types of lexico-semantic information acquired, categorised named entities, e.g. Paris denotes a city, improved the system the most and this information was obtained with the syntax-based method.
Try our demo's of semantically related words (in Dutch). The complete text of my thesis can be found in here.