CORAA Versions

CORAA (CORpus de Áudios Anotados)

A large multi-purpose corpus of Brazilian Portuguese audio files aligned with transcriptions and manually validated for the purpose of training ASR and TTS models and also Sentiment Analysis using acoustic audio features.

Version 1: CORAA ASR - Academic Corpora Projects (finished work in September 15 2021)

CORAA ASR version 1.1 is composed of academic corpora projects and a collection of TeD talks; all of them with academic license (CC BY NC ND 4.0 International).

Nurc-Recife (OLIVEIRA JR, 2017) contains spontaneous and prepared speech from Recife - PE;
SP2010 (MENDES, 2011) contains spontaneous and read speech from the interior of São Paulo;
ALIP (GONÇALVES, 2019) contains spontaneous speech from the interior of São Paulo; and
C-ORAL Brasil I (RASO & MELLO, 2012) contains spontaneous speech from Minas Gerais;
TeD Talks collection contains 72.74 hours of prepared speech taken from TeDx Talks in Portuguese.

ALIP, SP2010, C-ORAL Brasil, and NURC-RECIFE currently total 228.05 hours; all the datasets together total 290.79 hours. They were manually validated for training ASR models for Portuguese, using the BrazSpeechData Platform.

This dataset is available at https://github.com/nilc-nlp/CORAA/ to be part of the Shared Task on on ASR for spontaneous and prepared speech.

Version 2: CORAA NURC-SP (finished work in December 15 2023)

NURC-SP contains spontaneous and prepared speech from São Paulo, capital, divided in a part with audios and manual transcriptions (47 inquiries) and audios only (328 inquiries). NURC-SP was divided in three subcorpora in the Tarsila project: Córpus Mínimo, CATNA (Córpus de Áudios Transcritos não Anotados) and Córpus de Áudios:

Córpus Mínimo. The dataset of 21 files (~ 19 hours) of the part of NURC-SP corpus composed of audios and manual transcriptions was automatically aligned using the set of tools for forced alignment aeneas (https://www.readbeyond.it/aeneas/). This dataset, called Córpus Mínimo, was manually segmented according to prosodic criteria based on the work by Raso and Mello (2012). It is available at Portulan Clarin repository under CC BY-NC-ND 4.0 license - https://hdl.handle.net/21.11129/0000-000F-73CA-C
CATNA (Córpus de Áudios Transcritos não Anotados). The dataset is composed of 26 files (~ 24 hours) (audios and transcriptions non-aligned), in which 21 of them were automatically segmented by an automatic prosodic segmentation method based on Biron et al. (2021), customized for Brazilian Portuguese (https://github.com/nilc-nlp/ProsSegue); 5 of them were automatically aligned using the forced alignment tool aeneas (https://www.readbeyond.it/aeneas/). The automatic segmentations were revised by students who finished the revision on May 31 2024.
Córpus de Áudios. The dataset of 328 audios (~ 239.30 valid hours) were automatically transcribed by WhisperX which provides fast transcription using the large-v2 model of Whisper and diarization via pyannote-audio. Automatic transcripts were reviewed by a number of students varying from 6 to 18 from June to December 2023.

This dataset is available at the NURC-SP DIGITAL Portal for linguistic analysis: http://tarsila.icmc.usp.br:8080/nurc/home.

The NURC-SP Audio Corpus was also prepared for evaluating the ASR task: https://github.com/nilc-nlp/nurc-sp-audio-corpus

Version 3: CORAA Programa Certas Palavras (digitalization completed, finished work in May 31 2024)

Programa Certas Palavras Série 4 (from CEDAE, UNICAMP), containing 133 K7 tapes and 6 roll tapes; and Série 1 (from CEDAE, UNICAMP) containing 10 roll tapes.

The “Certas Palavras” program was conceived in 1980, on the initiative of journalists Claudiney José Ferreira and Jorge Marques de Vasconcellos, and was implemented in the form of a program broadcast on radio, dealing with books and ideas. CORAA Certas Palavras dataset contains ~ 63 hours of valid spontaneous speech divided in 163 episodes. The episodes were automatically transcribed by WhisperX which provides fast transcription using the large-v2 model of Whisper and diarization via pyannote-audio. Two students revised the automatic transcriptions from November 2023 to March 2024 and 12 students revised the diarization labels of pyannote-audio for the real speaker names (from April to May 2024), resulting a resource for evaluating ASR, TTS and diarization methods.

License: CC BY-NC 4.0

https://github.com/GustavoEvangelistaAraujo/CertasPalavras/

Dataset: https://huggingface.co/datasets/nilc-nlp/certas_palavras/

Version 4: CORAA MuPe life-stories (automatic transcription revision work completed on April 2024)

289 life-stories of MuPe collection (365 valid hours) were processed and anonymized to be part of our corpus.
The life-stories were automatically transcribed by WhisperX which provides fast transcription using the large-v2 model of Whisper and diarization via pyannote-audio. Ten students revised the automatic transcriptions from June 2023 to April 2024. The evaluation of pyannote-audio diarization labels was carried out in May 2024 with two trained students, throughout the life history available at https://www.youtube.com/watch?v=ctNiVFxcep0#. This life story was divided into 1,146 segments by WhisperX and Cohen's Kappa for 2 raters was 0.947, considered almost perfect (Landis & Koch, 1977). Therefore, we consider automatic diarization labels suitable for use in training ASRs and TTS models.

Version 5: CORAA SOFIA-FALA (ongoing work)

Corpus based on the Sofia Fala project (MELONI et al., 2021, Souza et al., 2019, Stella et al., 2019, Souza et al., 2018, Andrade et al., 2000, WERTZNER 2000):

Impaired speech corpus in the capture phase
Website for the donation of 10 audios captured by the reading of 10 sentences per person with speech disorder: https://dcm.ffclrp.usp.br/coleta/html/inicio.php
Rules and ethical aspects for donation following LGPD: TCLE and TCUISV specific for donation
To be used in an ASR for impaired speech Shared-Task to advance methods and applications to help on speech delay, phonological disorder, phonetic disorder, childhood speech apraxia, dysarthria, associated or not with a diagnosis of Down syndrome, autism spectrum disorder and intellectual deficit.

Version 6: CORAA SER Sentiment Analysis Dataset (finished work in September 15 2021)

CORAA SER version 1.0 is composed of approximately 50 minutes of audio segments labeled in three classes: neutral, non-neutral female, and non-neutral male. While the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech.

This dataset was built from the C-ORAL-BRASIL I corpus (Raso and Mello, 2012). The available corpus consists of audio segments representing Brazilian Portuguese informal spontaneous speech. The non-neutral emotion class was labeled considering paralinguistic elements (laughing, crying, etc).

This dataset is available at https://github.com/rmarcacini/ser-coraa-pt-br/ to be part of the Shared-Task on Speech Emotion Recognition with Acoustic Aspects

References

ANDRADE, C.R.F.; BÉFI-LOPES, D.M.; FERNANDES, F.D.M.; WERTZNER, W. H. ABFW: Language Test in the areas of Phonology, Vocabulary, Fluency and Pragmatics. Carapicuiba (SP): Pró–Fono, 2000. 90 p.

Biron T, Baum D, Freche D, Matalon N, Ehrmann N, et al. (2021) Automatic detection of prosodic boundaries in spontaneous speech. PLOS ONE 16(5): e0250969. https://doi.org/10.1371/journal.pone.0250969.

Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.

Landis, J. R. & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33.

MELONI, FERNANDO ; SICCHIERI, BIANCA ; MANDRA, PATRICIA ; BULCAO-NETO, RENATO ; MACEDO, ALESSANDRA ALANIZ . A Nonverbal Recognition Method to Assist Speech. In: 2021 IEEE 34th International Symposium on ComputerBased Medical Systems (CBMS), IEEE Computer Society, 2021. v. 1. p. 360-365.

MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>. Acesso em 06/June/2021.

Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.

RASO, T. ; MELLO, H. The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.

SOUZA, F. C. M. ; SOUZA, A. C. C. ; WATANABE, C. ; MANDRA, P. P. ; MACEDO, A. A. . An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS, v. 177, p. 1-9, 2019.

SOUZA, F. C. M. ; SOUZA, A. C. C. ; NAKAMURA, G. M. ; SOARES, M. D. ; MANDRA, P. P. ; MACEDO, A. A. . Investigating Non-Articulatory Sound Recognition with Statistical Tests and Support Vector Machine. In: 15th International Conference on Information Technology, 2018.

STELLA, R. ; JUNQUEIRA, V. ; ANDRADE, M. C. N. B. ; SOARES, M. D. ; MACEDO, A. A. . Aplicativo desenvolvido na USP ajuda a treinar fala de crianças com Down. Jornal da USP, 12 fev. 2019.

WERTZNER, H.F. Phonology. In: ANDRADE, C. R. F.; BEFI-LOPES, D.M.; FERNANDES, F.D.M.; WERTZNER, HF ABFW: Child Language Test in the areas of Phonology, Vocabulary, Fluency and Pragmatics. São Paulo: Pró-Fono, 2000. cap. 1, p. 5-40.

Page updated

Google Sites

Report abuse