Welcome on the public web site of the ASFALDA project (ANR-12-CORD-023)
The ASFALDA project started in october 2012, and ended in june 2016. It was funded by the Agence Nationale de la Recherche (ANR). It was coordinated by Marie Candito (Alpage team, Univ Paris Diderot / INRIA), the other partners are Ant'inno, CEA-List, LIF, LLF, MELODI.
The project will contribute to the major challenge of the generalization of electronic content, and the subsequent need for sophisticated tools:
that access to content, in various ways: efficient information retrieval, document summary, document classification, machine translation, information extraction
that make inference over annotated content
To achieve these objectives, we rely on an existing standard for semantic annotation of predicates and roles (FrameNet), and on existing previous effort of linguistic annotation for French (the French Treebank).
The original FrameNet project provides a structured set of prototypical situations, called frames, along with a semantic characterization of the participants of these situations (called “roles”). We propose to take advantage of this semantic database, which has proved largely portable across languages, to build a French FrameNet, meaning both a lexicon listing which French lexemes can express which frames, and an annotated corpus in which occurrences of frames and roles played by participants are made explicit. The addition of semantic annotations to the French Treebank, which already contains morphological and syntactic annotations, will boost its usefulness both for linguistic studies and for machine-learning-based Natural Language Processing applications for French, such as content semantic annotation, text mining or information extraction.
To cope with the intrinsic coverage difficulty of such a project, we adopt a hybrid strategy to obtain both exhaustive annotation for some specific selected concepts (commercial transaction, communication, causality, sentiment and emotion, time), and exhaustive annotation for some highly frequent verbs.
The project is structured as follows:
Task 1 concerns the delimitation of the focused FrameNet substructure, and its coherence verification, in order to make the resulting structure more easily usable for inference and for automatic enrichment (with compatibility with the original model);
Task 2 concerns all the lexical aspects: which lexemes can express the selected frames, how they map to external resources, and how their semantic argument can be syntactically expressed, an information usable for automatic pre-annotation on the corpus;
Task 3 is devoted to the manual annotation of corpus occurrences (we target 20000 annotated occurrences);
In Task 4 we will design a semantic analyzer, able to automatically make explicit the semantic annotation (frames and roles) on new sentences, using machine learning on the annotated corpus;
The scientific key aspects of the project are:
an emphasis on the diversity of ways to express the same frame, including expression (such as discourse connectors) that cross sentence boundaries,
an emphasis on semi-supervised techniques for semantic analysis, to generalize over the available annotated data
The project is ambitious and could neither be achieved without intensive collaboration, nor by any one partner alone. The partners involved provide a strong synergy, with competence in linguistic annotations (LLF, Alpage, IRIT), discourse analysis (IRIT and Alpage), syntactic parsing and machine learning techniques (Alpage, LIF, CEA LIST) and NLP-enhanced search engines (CEA LIST and Ant’inno).