I am Associate Professor (maître de conférences) in data science at the Data Science department of IMT Atlantique, Brest, France.
I have a PhD in Computer Science and am Associate Professor (maître de conférences) in data science at the Data Science department of IMT Atlantique, Brest, France.
I am researching computational systems that deal with the diversity of languages and their speakers, with two goals in mind: to facilitate interlingual communication, but also to maintain what is unique and different in each language. As an example of the former, my research on the interoperability of multilingual healthcare data aims to allow patients to be cured regardless of linguistic and administrative barriers. My research on linguistic diversity and modelling bias in language resources, on the other hand, tries to help capture “the local:” words and ideas that tend to get lost in translation, yet that tend to matter to us all the more. Having a knack for implementing things, I enjoy building actual systems and participating in projects where my research is applied.
My other online profiles
I am currently working on...
I am glad to have become Associate Editor of Springer Nature Computer Science.
Our paper on language modelling bias in AI was accepted to the FAccT 2024 conference!
This recently accepted paper honours an amazing effort by my colleague Abed Alhakim Freihat on a new, high-quality wordnet resource for Arabic.
I was involved in a paper on pluriversal design for language technology, accepted by the CoDesign journal.
Our paper on ethics / epistemic injustice in the field of AI language modelling was accepted by the journal Ethics and Information Technology.
The article of my PhD student, Hadi Khalilia (with contributions from master student Shandy Darma), on lexical diversity among Arabic dialects and Indonesian languages, was accepted by Frontiers in Psychology.
I am glad to share that, together with my collaborator Paula Helm, I was interviewed by Johana Bhuiyan from The Guardian on the topic of linguistic bias in AI language technology and in machine translation in particular. Our research on linguistic bias studies how language technology favours, by design, certain languages to the detriment of others.
https://www.theguardian.com/us-news/2023/sep/07/asylum-seekers-ai-translation-appsOur review article on the representational power of multilingual lexical databases, titled Representing Interlingual Meaning in Lexical Databases, was published by the prestigious journal Artificial Intelligence Review.
The UKC, our large multilingual lexical database, is being used in an increasing number of projects. The last one is a study on metonymy as a universal cognitive phenomenon, with our paper accepted for CogSci 2022.
The Universal Knowledge Core, a huge multilingual lexical database with >1,100 languages and >2 million words, has been released and is browseable online. Our goal is to provide a diversity-aware linguistic database that highlights both what is unique and what is common among languages. Do not hesitate to browse the lexicons, check out the visualisations, try out the word translations, or check out the related projects!
Check out the new website dedicated to the Language of Data, with free downloads of annotated corpora and machine learning models: http://www.languageofdata.science.
I am coordinating efforts on building the Universal Knowledge Core, a large-scale multilingual lexico-semantic resource;
I am managing collaborative research and development on SCROLL, a multilingual NLP platform dedicated to the semantic analysis of text in structured data.
Other ongoing and past research
Roughly in reverse chronological order:
The InteropEHRate project on citizen-driven health data interoperability has been successfully closed and reviewed! The project was a serious validation opportunity for our research in multilingual data integration, and a huge learning experience for me about AI in healthcare. I also appreciated the professionalism and the dedication of our colleagues and partners.
Our paper "Language Diversity: Visible to Humans, Exploitable by Machines" has been accepted to the ACL 2022 demo track.
I have participated in the organisation and review of the SIGMORPHON 2022 shared task on morphological segmentation.
We have three papers published at LREC 2022!
Our conference paper presenting our new database and visualization of the similarity of contemporary lexicons, has been accepted for TSD 2021.
Our journal paper on the huge CogNet database, titled A Large and Evolving Cognate Database, was accepted by the Language Resources and Evaluation Journal.
our paper, introducing a new research area on the Language of Data, was accepted for COLING 2020;
a paper on domain-grammar-based matching I co-authored with Francisco J. Quesada Real, Fiona McNeill, and Alan Bundy, was just accepted to the 15th Ontology Matching Workshop;
our paper with Mattia Fumagalli et al. was just accepted for the FOIS 2020/2021 conference;
submitted a paper to the LRE journal on CogNet, the largest cognate database in the world;
our paper, A Major Wordnet for a Minority Language: Scottish Gaelic, has been accepted for LREC 2020;
finished work on a conference paper on cross-lingual medical data integration, accepted for PAIS@ECAI 2020;
I led a small research group that participated to a UK-funded SPRINT project on Scottish healthcare data integration;
I was Programme Chair at the successful CONTEXT 2019 conference;
I have released the open-source Diversicon Framework, co-written with David Leoni, that helps with the pluggable integration of lexico-semantic domain resources for natural language understanding tasks;
contributed to a series of papers on Arabic NLP, the bulk of the work having been done by my colleague Hakim;
managed a project on medical knowledge and data integration for the National Health Services of Scotland;
workpackage leader in the Healthcare Data Safe Havens EIT project, for cross-jurisdictional integration of medical experiments;
workpackage leader in the ESSENCE Marie Curie training network, partaking in the organisation of numerous academic events (summer schools, workshops, competitions);
reserarching cross-lingual and domain-based ontology matching, especially applied to classifications;
quantitative research on language diversity and its application to the generation of linguistic resources;
WP leader in the QROWD project (H2020 innovation action) on crowdsourced collection and curation of transportation data;
lead the development of a UI widget for semantic annotation of multilingual text (by word senses and named entities);
project leader on Open Data Trentino, a semantic open data integration project;
as part of my PhD thesis, I designed an extensible knowledge representation framework for symbols of the world’s writing systems; a concrete outcome was an OWL representation of all Unicode characters and their properties, thus providing extensibility to Unicode while respecting its status as a centrally controlled international standard.