TaRSila
Reconhecimento Automático de Fala e Síntese de Fala no Centro de ia
Tarefa de Anotação para o Reconhecimento e Síntese de fala da Língua Portuguesa
l
The project TaRSila aims at growing speech datasets for Brazilian Portuguese language, looking to achieve state-of-the-art results for the following tasks:
(a) automatic speech recognition (ASR) that automatically transcribes speech;
(b) multi-speaker synthesis (TTS) that generates several voices from different speakers;
(c) speaker identification/verification that selects a speaker from a set of predefined members (speakers seen during the training of the models --- called closed-set sceneario --- or in open-set scenario in which the verification occurs with speakers not seen during the training of the models); and
(d) voice cloning that usess a few minute/second voice dataset to train a voice model with synthesis methods, which can read any text in the target voice.
In TaRSila, we manually validated speech datasets of academic projects such as: (i) Nurc-Recife (OLIVEIRA JR, 2016); (ii) SP 2010 (MENDES, 2013); (iii) ALIP (GONÇALVES, 2019); and (iv) C-ORAL Brasil (RASO & MELLO, 2012).
A collection of 365 hours of the Museu da Pessoa (MuPe) life-stories was processed to be be part of our large corpus CORAA (COrpus de Aúdios Anotados) and NURC-SP Audio Corpus was also processed for the purpose of training ASR models. See details of all the datasets created on CORAA Versions.
Regarding the tools, we aim to investigate recent deep learning methods for training robust ASR and TTS models for Portuguese.
The project also foresees applications in semantic search from speech transcriptions, as well as sentiment analysis and automatic organization of speech datasets into topics.
This project is part of the Natural Language Processing initiative (NLP2) of the Center for Artificial Intelligence (C4AI) of the University of São Paulo, sponsored by IBM and FAPESP (grant #2019/07665-4). The center is part of the FAPESP Engineering Research Centers Program and is committed to state-of-the-art research in Artificial Intelligence, exploring both foundational issues and applied research. See also the NLP2 web portal !
Related Publications
Published & Accept for Publication:
E. CASANOVA; A. CANDIDO JR; C. SHULBY; F. S. OLIVEIRA; J. P. TEIXEIRA; M. A. PONTI; S. M. ALUÍSIO. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation, 2022. (https://github.com/Edresson/TTS-Portuguese-Corpus)
E. CASANOVA; C. SHULBY; E. GÖLGE; N. M. MÜLLER; F. S. OLIVEIRA; A. CANDIDO JR; A. S. SOARES; S. M. ALUÍSIO; M. A. POINT. SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. In: INTERSPEECH, 2021, Brno. Interspeech 2021, ISCA, 2021. p. 3645-3649.
E. CASANOVA; A. CANDIDO JR; C. SHULBY; F. S. OLIVEIRA; L. R. S. GRIS; H. P. SILVA; S. M. ALUÍSIO; M. A. PONTI. Speech2Phone: A new and efficient method for training speaker recognition models. 2021. In: 10th Brazilian Conference on Intelligent Systems (BRACIS), 2021.
Lucas Rafael Stefanel Gris, Edresson Casanova, Frederico Santos de Oliveira, Anderson da Silva Soares, Arnaldo Candido Junior. Desenvolvimento de um modelo de reconhecimento de voz para o Português Brasileiro com Poucos Dados Utilizando o Wav2vec 2.0. In: Anais do XV Brazilian e-Science Workshop. SBC, 2021. p. 129-136.
GONZAGA, V. M.; MURRUGARRA-LLERENA, N. ; MARCACINI, R. M. Multimodal intent classification with incomplete modalities using text embedding propagation. In: Brazilian Symposium on Multimedia and Web (WebMedia), 2021 (Best Short Paper). https://doi.org/10.1145/3470482.3479636.
Gôlo, Marcos PS, Rafael G. Rossi, and Ricardo M. Marcacini. "Triple-VAE: A Triple Variational Autoencoder to Represent Events in One-Class Event Detection." In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pp. 643-654. SBC, 2021. (Best Paper ENIAC 2021 - Main track) https://doi.org/10.5753/eniac.2021.18291
Souza, Mariana C. de; Bruno M. Nogueira; Rafael G. Rossi; Ricardo M. Marcacini, and Solange O. Rezende. A Heterogeneous Network-Based Positive and Unlabeled Learning Approach to Detect Fake News. In Brazilian Conference on Intelligent Systems, pp. 3-18. Springer, Cham, 2021. https://dx.doi.org/10.1007/978-3-030-91699-2_1
Souza, Mariana C, Bruno Magalhães Nogueira, Rafael Geraldeli Rossi, Ricardo Marcondes Marcacini, Brucce Neves Dos Santos, and Solange Oliveira Rezende. A network-based positive and unlabeled learning approach for fake news detection. Machine Learning (2021): 1-44. https://dx.doi.org/10.1007/s10994-021-06111-6
Mattos, Joao Pedro Rodrigues; and Ricardo M. Marcacini. Semi-Supervised Graph Attention Networks for Event Representation Learning. In 2021 IEEE International Conference on Data Mining (ICDM), pp. 1234-1239. IEEE, 2021. https://doi.org/10.1109/ICDM51629.2021.00150
Carmo, Paulo, and Ricardo Marcacini. Embedding propagation over heterogeneous event networks for link prediction. In 2021 IEEE International Conference on Big Data (Big Data), pp. 4812-4821. IEEE, 2021. https://doi.org/10.1109/BigData52589.2021.9671645
Lucas Rafael Stefanel Gris, Edresson Casanova, Frederico Oliveira, Anderson da Silva Soares and Arnaldo Candido Junior. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Computational Processing of the Portuguese Language - 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. Lecture Notes in Computer Science 13208, Springer 2022, ISBN 978-3-030-98304-8. https://dblp.org/rec/conf/propor/GrisCOSJ22
Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir Antonelli Ponti YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. Pre-print version Accept for Publication in The 39th International Conference on Machine Learning (ICML 2022).
Marcelo Matheus Gauy & Marcelo Finger. "Pretrained audio neural networks for Speech emotion recognition in Portuguese". In the Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online). Vol. 1, pp. 15-24, 2022. Link: https://sites.google.com/view/ser2022/
Caroline Alves, Bruno Carlotto, Bruno Dias, Anátale Garcia, Bruno Gianesi, Renan Izaias, Maria Luiza Morais, Paula Oliveira, Vinícius G. Santos, Rafael Sicoli, Flaviane R. Fernandes Svartman, Sandra Aluísio & Sidney Leal. "Transfer Learning and Data Augmentation Techniques applied to Speech Emotion Recognition in SE&R 2022". In the Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online). Vol. 1, pp. 25-36, 2022. Link: https://sites.google.com/view/ser2022/
Alexander Scaranti, Douglas Silva, Fernando Meloni & Alessandra Alaniz. "Speech Emotion Recognition in Portuguese for SofiaFala: SER SofiaFala". In the Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online). Vol. 1, pp. 37-41, 2022. Link: https://sites.google.com/view/ser2022/
Arnaldo Candido Junior, Edresson Casanova & Ricardo Marcacini. "Overview of the Automatic Overview of the Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese (S&ER) Shared-tasks at PROPOR 2022". In the Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese, co-located with PROPOR 2022. March 21st, 2022 (Online). Vol. 1, pp. 1-8, 2022. Link: https://sites.google.com/view/ser2022/
Santos, V.G., Alves, C.A., Carlotto, B.B., Papa Dias, B.A., Stefanel Gris, L.R., Lima Izaias, R.d., Azevedo de Morais, M.L., Marin de Oliveira, P., Sicoli, R., Svartman, F.R.F., Leite, M.Q., Aluísio, S.M. (2022) CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech . Proc. IberSPEECH 2022, 161-165, doi: 10.21437/IberSPEECH.2022-33
Candido Junior, A., Casanova, E., Soares, A. et al. CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Lang Resources & Evaluation (2022). https://doi.org/10.1007/s10579-022-09621-4
GRIS, L. R. S. ; CANDIDO JUNIOR, A. ; SANTOS, V. G. ; DIAS, B. A. P. ; LEITE, M. Q. ; SVARTMAN, F. R. F. ; ALUISIO, SANDRA . Bringing NURC/SP to Digital Life: the Role of Open-source Automatic Speech Recognition Models. In: XIX ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 2022, Campinas/SP. Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional. Porto Alegre, Brasil: SBC, 2022. p. 330-341. https://sol.sbc.org.br/index.php/eniac/article/view/22793/22616
Casanova, E., Shulby, C., Korolev, A., Junior, A.C., Soares, A.d.S., Aluísio, S., Ponti, M.A. (2023) ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion. Proc. INTERSPEECH 2023, 1244-1248, doi: 10.21437/Interspeech.2023-496
Mendes da Silva, A.C., Silva, D.F. and Marcacini, R.M., 2022, December. Heterogeneous Graph Neural Network for Music Emotion Recognition. In 23rd International Society for Music Information Retrieval Conference (ISMIR 2022). https://archives.ismir.net/ismir2022/paper/000080.pdf
Moraes, Leonardo, Ricardo Marcondes Marcacini, and Rudinei Goularte. "Video Summarization using Text Subjectivity Classification." In Proceedings of the Brazilian Symposium on Multimedia and the Web, pp. 133-141. 2022. https://doi.org/10.1145/3539637.3556998
Toledo, G.L. and Marcacini, R.M., 2022. Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis. In LatinX Workshop at The Thirty-ninth International Conference on Machine Learning (LatinX @ICML 2022). Extended Abstract Paper. https://www.youtube.com/watch?v=6iVm8jl27xI
Rodrigues A. C., Marcacini R. M. Sentence Similarity Recognition in Portuguese from Multiple Embedding Models. In2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) 2022 Dec 12 (pp. 154-159). IEEE. https://doi.org/10.1109/ICMLA55696.2022.00029
Edresson Casanova, Vinicius G. Santos, Flaviane R. Fernandes Svartman, Marli Quadros Leite, Arnaldo Candido Jr., Ricardo M. Marcacini, Solange O. Rezende & Sandra Maria Aluisio. Recursos para o processamento de fala. In: Caseli, H.M.; Nunes, M.G.V. (org.). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2023. Disponível em: https://brasileiraspln.com/livro-pln/1a-edicao/parte2/cap3/cap3.html
Leal, S.E., Duran, M.S., Scarton, C.E., Aluisio, S.M. NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09693-w
Frederico S. Oliveira, Edresson Casanova, Arnaldo Candido Junior, Anderson S. Soares & Arlindo R. Galvão Filho. CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages. In: Text, Speech, and Dialogue (TSD 2023), 2023, Plzeň, Czechia. Text, Speech, and Dialogue. Cham: Springer Nature Switzerland, 2023. p. 188-199.
Frederico S. Oliveira, Edresson Casanova, Arnaldo Cândido Júnior, Lucas R. S. Gris, Anderson S. Soares, Arlindo R. Galvão Filho. Evaluation of Speech Representations for MOS Prediction. In: Text, Speech, and Dialogue (TSD 2023), 2023, Plzeň, Czechia. Text, Speech, and Dialogue. Cham: Springer Nature Switzerland, 2023. p. 270-282.
TOMITA, Victor Akihito Kamada; DA SILVA, Angelo Cesar Mendes; MARCACINI, Ricardo Marcondes. Cluster Fusion Training: Exploring Cluster Analysis to Enhance Cross-Domain Sentiment Classification. In: Anais do XX Encontro Nacional de Inteligência Artificial e Computacional. SBC, 2023. p. 330-344. OBS: Terceiro lugar (best paper session)
MORAES, Marcelo Isaias; MARCACINI, Ricardo Marcondes. On the Use of Aggregation Functions for Semi-Supervised Network Embedding. In: 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023. p. 1-8.
Edresson Casanova, Sandra Aluísio, and Moacir Antonelli Ponti. 2024. TTS applied to the generation of datasets for automatic speech recognition. In Proceedings of the 16th International Conference on Computational Processing of Portuguese (Propor 2024), pages 633–638, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics (LINK: https://aclanthology.org/2024.propor-1.73).
Giovana Meloni Craveiro, Vinicius Gonçalves Santos, Gabriel Jose Pellisser Dalalana, Flaviane R. Fernandes Svartman, and Sandra Maria Aluísio. 2024. Simple and Fast Automatic Prosodic Segmentation of Brazilian Portuguese Spontaneous Speech. In Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024), pages 32–44, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics (LINK: https://aclanthology.org/2024.propor-1.4/).
Ana Carolina Rodrigues, Alessandra A. Macedo, Arnaldo Candido Jr, Flaviane R. F. Svartman, Giovana M. Craveiro, Marli Quadros Leite, Sandra M. Aluísio, Vinícius G. Santos, and Vinícius M. Garcia. 2024. Portal NURC-SP: Design, Development, and Speech Processing Corpora Resources to Support the Public Dissemination of Portuguese Spoken Language. In Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024), pages 187–195, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics (LINK: https://aclanthology.org/2024.propor-1.19).
Ariadne Matos, Gustavo Araújo, Arnaldo Candido Junior, and Moacir Ponti. 2024. Accent Classification is Challenging but Pre-training Helps: a case study with novel Brazilian Portuguese datasets. In Proceedings of the 16th International Conference on Computational Processing of Portuguese (Propor 2024), pages 364–373, Santiago de Compostela, Galicia/Spain. Association for Computational Linguistics (LINK: https://aclanthology.org/2024.propor-1.37).
Giovana Meloni Craveiro and Julio Cesar Galdino. Diversity in Data for Speech Processing in Brazilian Portuguese. Proceedings of The 34th Brazilian Conference on Intelligent Systems (BRACIS 2024).
Rodrigo Lima, Sidney Leal, Arnaldo Candido Jr. and Sandra Maria Aluisio. A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation. Proceedings of the 34th Brazilian Conference on Intelligent Systems (BRACIS 2024). Pre-print version
ARAÚJO, Gustavo E. et al. EyetrackingMOS: Proposta de um método de avaliação online para modelos de síntese de fala. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 87-96. DOI: https://doi.org/10.5753/stil.2024.245424. Link: https://sol.sbc.org.br/index.php/stil/article/view/31120.
Julio Cesar Galdino, Gustavo Araújo, Miguel Oliveira Jr., Arnaldo Candido Junior, Moacir Ponti, Sandra Aluísio. Acoustic Analysis of Prosodic Features in Natural versus Synthesized Speech Samples from YourTTS and SYNTACC Models. Proceedings of the XXI Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2024) (It will be available soon at: https://sol.sbc.org.br/index.php/eniac/).
Extended Abstracts, TCCs & Technical Reports:
GRIS, L. R. S. ; CANDIDO JUNIOR, A. Reconhecimento de Voz Utilizando Wav2vec 2.0 para o Português Brasileiro. 2021. Trabalho de Conclusão de Curso. (Graduação em Ciência da Computação) - Universidade Tecnológica Federal do Paraná.
GIANESI, Bruno Honorio do Carmo. Classificação de gênero via análise de áudio utilizando métodos de aprendizado de máquina tradicionais. 2021. Trabalho de Conclusão de Curso (Graduação) – Escola de Engenharia de São Carlos, Universidade de São Paulo, São Carlos, 2021. Disponível em: https://repositorio.usp.br/directbitstream/93fced06-d7a6-424d-bd41-fb39d0c175e1/Gianesi_Bruno_tcc.pdf. Acesso em: 20 maio 2022.
GRIS, L. R. S. ; CANDIDO JUNIOR, A. Automatic Spoken Language Identification using Convolutional Neural Networks. In: Latinoware, 2020, Foz do Iguaçu. Anais do Latin Science, 2020 (Extended Abstract).
Peçanha, Nicholas. Adaptação de um método automático de segmentação prosódica baseado em heurísticas para o português do Brasil, usando dados de palestras e aulas do NURC-SP. Trabalho de Conclusão de Curso (Graduação) – Escola de Engenharia de São Carlos, Universidade de São Paulo, São Carlos, 2022.
Velloso, Rafael Meliani. Adaptação de um método automático de segmentação prosódica baseado em heurísticas para o português do Brasil: Avaliação com diálogos e entrevistas do dataset NURC-SP. Trabalho de Conclusão de Curso (Graduação) – ICMC, Universidade de São Paulo, São Carlos, 2022. https://github.com/Rafael-M-V/IC-TaRSila
Submitted:
Lucas Gris, Ricardo Marcacini, Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Sandra Maria Aluísio. Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person. Prosodic Interfaces, De Gruyter eds., Miguel Oliveira Jr. org. Pre-print version.
GONZAGA, V. M.; MURRUGARRA-LLERENA, N. ; MARCACINI, R. M. Multimodal intent classification with incomplete modalities using heterogeneous networks. Multimedia Systems (in review).
Sidney Evaldo Leal, Arnaldo Candido Junior, Ricardo Marcacini, Edresson Casanova, Odilon Gonçalves, Anderson Silva Soares, Rodrigo Freitas Lima, Lucas Rafael Stefanel Gris and Sandra Aluísio. MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling. Submitted to The 31st International Conference on Computational Linguistics COLING 2025
CORAA (CORpus de Aúdios Anotados)
A large multi-purpose corpus of Brazilian Portuguese audio files aligned with transcriptions and manually validated for the purpose of training ASR and TTS models and also Sentiment Analysis using acoustic audio features.
MuPe Life Stories
One of the first applications of TaRSila project will be the automatic transcription of life stories using Automatic Speech Recognition (ASR). This is a result of a partnership between Tarsila researches, CEIA/UFG and MuPe non-governmental organization. A large number of original MuPe stories are captured in video and audio, therefore, TaRSila ASR is planned to be used in transcription generation, simplifying the process of searching this large story database.
C4AI Collaborators
References
Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.
MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>. Acesso em 06/June/2021.
Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.
RASO, T. ; MELLO, H. . The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portugues. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.
TaRSila logo was designed by Paula Marin de Oliveira and Bruno Baldissera Carlotto.
CORAA logo was designed by Paulo Matheus Silva Oliveira