Natural Language Processing for Portuguese (NLP2)

The NLP2 project aims at investigating and developing resources, tools and applications to lead Portuguese NLP to world state-of-the-art, effectively moving Portuguese out of the low-resource language scenario. It is an effort of the Center for Artificial Intelligence (C4AI) of the University of São Paulo, sponsored by IBM and FAPESP (grant #2019/07665-4). The center is part of the FAPESP Engineering Research Centers Program and is committed to state-of-the-art research in Artificial Intelligence, exploring both foundational issues and applied research.

The initiative currently concentrates on both written and spoken modalities for Portuguese, focusing on three main tasks: (i) with a syntactical view, following the Universal Dependencies theory, we aim to produce a large multi-genre corpus of annotated texts and to build robust tagging and parsing models; (ii) with a language modeling view, we aim to generate a pipeline for constructing context-based neural models, with applications on natural language inference; (iii) for spoken language, we aim to build multitask corpora for speech recognition, multi-speaker synthesis, speaker identification, voice cloning and speech-as-a-biomarker classification, producing a large corpus of recorded and transcribed spoken Brazilian Portuguese. Such research fronts were organized in three main subprojects, namely, POeTiSA, TaRSila and Carolina. Some of their main information are gathered at this portal. More details are available at their websites.