CORAA Versions

CORAA  (CORpus de Áudios Anotados)

A large multi-purpose corpus of Brazilian Portuguese audio files aligned with transcriptions and manually validated for the purpose of training ASR and TTS models and also Sentiment Analysis using acoustic audio features.

Version 1:  CORAA ASR - Academic Corpora Projects (finished work - September 15 2021) 

CORAA ASR version 1.1 is composed of academic corpora projects and a collection of TeD talks; all of them with academic license (CC BY NC ND 4.0 International).

ALIP,  SP2010, C-ORAL Brasil, and NURC-RECIFE currently total 228.05 hours; all the datasets together total 290.79 hours. They were manually validated for training ASR models for Portuguese, using the BrazSpeechData Platform.

This dataset is available at https://github.com/nilc-nlp/CORAA/ to be part of the Shared Task on on ASR for spontaneous and prepared speech.

Version 2: CORAA NURC-SP (finished work - December 15 2023) 

NURC-SP contains 334 hours of spontaneous and prepared speech from São Paulo, capital, divided in a part with audios and manual transcriptions (47 inquiries) and audios only (328 inquiries). NURC-SP was divided in three subcorpora in the Tarsila project:  Córpus Mínimo, CATNA (Córpus de Áudios Transcritos não Anotados) and Córpus de Áudios:

This dataset is available at the NURC-SP DIGITAL Portal: http://tarsila.icmc.usp.br:8080/nurc/home.

Version 3: CORAA Programa Certas Palavras (digitalization completed, finished  work on May 31 2024)


The “Certas Palavras” program was conceived in 1980, on the initiative of journalists Claudiney José Ferreira and Jorge Marques de Vasconcellos, and was implemented in the form of a program broadcast on radio, dealing with books and ideas. CORAA Certas Palavras dataset contains ~ 63 hours of valid spontaneous speech divided in 163 episodes. The episodes were automatically transcribed by WhisperX which provides fast transcription using the large-v2 model of Whisper  and diarization via pyannote-audio. Two students revised the automatic transcriptions from November 2023 to March 2024 and 12 students revised the diarization labels of pyannote-audio for the real speaker names (from April to May 2024), resulting a resource for evaluationg TTS and diarization methods.


Version 4: CORAA MuPe life-stories (automatic transcription revision work completed in April 2024)

Version 5: CORAA SOFIA-FALA (ongoing work)

Corpus based on the Sofia Fala project  (MELONI et al., 2021, Souza et al., 2019, Stella et al., 2019, Souza et al., 2018, Andrade et al., 2000, WERTZNER 2000):

Version 6: CORAA SER  Sentiment Analysis Dataset  (finished work - September 15 2021

CORAA SER version 1.0 is composed of approximately 50 minutes of audio segments labeled in three classes: neutral, non-neutral female, and non-neutral male. While the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech. 

This dataset was built from the C-ORAL-BRASIL I corpus (Raso and Mello, 2012). The available corpus consists of audio segments representing Brazilian Portuguese informal spontaneous speech. The non-neutral emotion class was labeled considering paralinguistic elements (laughing, crying, etc). 

This dataset is available at https://github.com/rmarcacini/ser-coraa-pt-br/ to be part of the Shared-Task on Speech Emotion Recognition with Acoustic Aspects

References

ANDRADE, C.R.F.; BÉFI-LOPES, D.M.; FERNANDES, F.D.M.; WERTZNER, W. H. ABFW: Language Test in the areas of Phonology, Vocabulary, Fluency and Pragmatics. Carapicuiba (SP): Pró–Fono, 2000. 90 p.

Biron T, Baum D, Freche D, Matalon N, Ehrmann N, et al. (2021) Automatic detection of prosodic boundaries in spontaneous speech. PLOS ONE 16(5): e0250969. https://doi.org/10.1371/journal.pone.0250969.

Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.

MELONI, FERNANDO ; SICCHIERI, BIANCA ; MANDRA, PATRICIA ; BULCAO-NETO, RENATO ; MACEDO, ALESSANDRA ALANIZ . A Nonverbal Recognition Method to Assist Speech. In: 2021 IEEE 34th International Symposium on ComputerBased Medical Systems (CBMS), IEEE Computer Society, 2021. v. 1. p. 360-365.

MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>. Acesso em 06/June/2021.

Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.

RASO, T. ; MELLO, H.  The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.

SOUZA, F. C. M. ; SOUZA, A. C. C. ; WATANABE, C. ; MANDRA, P. P. ; MACEDO, A. A. . An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS, v. 177, p. 1-9, 2019.

SOUZA, F. C. M. ; SOUZA, A. C. C. ; NAKAMURA, G. M. ; SOARES, M. D. ; MANDRA, P. P. ; MACEDO, A. A. . Investigating Non-Articulatory Sound Recognition with Statistical Tests and Support Vector Machine. In: 15th International Conference on Information Technology, 2018.

STELLA, R. ; JUNQUEIRA, V. ; ANDRADE, M. C. N. B. ; SOARES, M. D. ; MACEDO, A. A. . Aplicativo desenvolvido na USP ajuda a treinar fala de crianças com Down. Jornal da USP, 12 fev. 2019.

WERTZNER, H.F. Phonology. In: ANDRADE, C. R. F.; BEFI-LOPES, D.M.; FERNANDES, F.D.M.; WERTZNER, HF ABFW: Child Language Test in the areas of Phonology, Vocabulary, Fluency and Pragmatics. São Paulo: Pró-Fono, 2000. cap. 1, p. 5-40.