
Selected papers


FABRA is a readability toolkit based on the aggregation of a large number of readability predictor variables targeting French. The toolkit is implemented as a service-oriented architecture, which obviates the need for installation, and simplifies its integration into other projects.

ALGLM Assessing-Linguistic-Generalisation-in-Language-Models

coFR COreference resolution tool For FRench

brWaC The Brazilian Portuguese Web as Corpus is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes. 




Courage@exist Participation of the team of the COURAGE project (from the University of Milano-Bicocca) at EXIST shared task ( 


Publications (full list)

Wilkens, Rodrigo; Villavicencio, Aline; Muller, Daniel; Wives, Leandro; Loh, Stanley. COMUNICA - A Question Answering System for Brazilian Portuguese In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING) 2010

Research Projects

   2021 to the present CEFR-FR Reitor: This project looks at the issue of automatic assessment of the written competence of learners of French as a foreign language (FFL) from a two-pronged perspective. First, by building on a collaboration with France Éducation International, we will compile the largest corpus of learner productions for FFL. This learner corpus will enable us to build up an inventory of linguistic phenomena and to estimate their distribution over the six levels of the Common European Framework of Reference for Languages (CEFR). In a second step, we will develop an artificial intelligence algorithm capable of assigning one of the six levels of the CEFR to a learner's written production. It will also be able to automatically identify the linguistic phenomena included in our inventory in a learner's production and link them to a CEFR level. This will enable us to provide a detailed diagnosis of the level of competence of this learner from different linguistic levels. An evaluation of the performance of this model and its usefulness for the training of future language assessors will be carried out.

  2020 to 2021 COURAGE: COURAGE at University of Milano-Bicocca is a collaboration funded by Volkswagen Foundation as part of the Artificial Intelligence and the Society of the Future funding initiative, including as partners the Universitat Pompeu Fabra (Spain), the Istituto per le Tecnologie Didattiche of the National Council of Research ITD-CNR, the Hochschule Ruhr West (Germany) and the Rhine-Ruhr Institute for System Innovation (Germany). This project brings together a multi-disciplinary consortium to develop novel approaches aimed at addressing some of the major challenges posed by social media to society and to young members of society. In particular, aiming to develop a Virtual Social Media Companion that educates and supports teenage school students facing the threats of social media such as discrimination and biases as well as hate speech, bullying, fake news and other toxic content. The University of Milano-Bicocca team drives two strands of this work, the machine-learning-based user modelling aspect and the process of analysing textual data drawing from our expertise in natural language processing (NLP).

  2019 - 2020 ALECTOR: This project address scientific issues including readability assessment, lexical simplification, syntactic simplification, and discourse transformations. Targeting dyslexic and poor readers, in this project, text transformations will be based on theoretical findings about the reading process and further refined by specific adaptations leveraging the feedback from the targeted audience. As one of the key innovative deliverables, ALECTOR will propose a web-based application where simplified corpus will be available to teachers and speech therapists.

    2017 - 2019 Smart and Adaptive Language Learning Applications (SMALLA): This project aims to allow second language learners to read texts of their interests and reading skills, helping e-learning environments to keep the learners engaged in learning activities. The three core research fields related are user-modeling, text profiling, and text classification. The user-modeling feature aims to measure the user skills by tracking user interaction and e-learning tests. The text-profiling feature acts as a background both searching for texts and building a profile from them. Finally, the text classification feature aims to put together the other core features results by selecting texts, which fits in their interests and reading skills. This would allow the learner to take his or her own interest into the e-learning platform and use its resources in specific reading and learning tasks.

    2014 - 2016 ExplainText Text simplification of complex expressions: The goal of this project is to investigate and develop techniques, resources and tools for automatic text simplification. The idea was to rewrite texts making them more accessible and easier to understand to a larger audience. Our focus was in lexical simplification, where more difficult words, in specific Multiword Expressions, are replaced by more familiar synonyms. The project Simplification of Complex Expressions was funded by Samsung Research.

    2013 - 2016 Computational cognitive language models in the Autism Spectrum Disorder: This project aims to develop computational resources and models to investigate factors related to the acquisition of children's language, and to the use of language in clinical conditions, such as autism spectrum disorders and aphasia. It is mainly centred in the influences of language processes in clinical and non-clinical cases dealing with linguistic and psycholinguistic information.

    2012 - 2016 Cognitive Computational Models of Natural Languages for Assessing Language Competency: In this project, we investigated the influence of language factors in low literacy and pathologies, focusing on Alzheimer’s Disease. Although there seem to have a link between factors like frequency and age of acquisition and strategies employed in processing, and in particular for Brazilian Portuguese there is still much to investigate. The long term goal of this investigation was improved scientific understanding of human language processing and its impact in the development of educational technology, as well as treatments and rehabilitation of various language disorders.

    2008 - 2011 COMUNICA - Databases Access by Telephone: The project consisted in the development of an automatic question answering system over telephone, allowing the population to access public digital data. The project was financed by public and private sectors, and it was developed by a group of companies, in collaboration with the Institute of Informatics of UFRGS.