Automatic Speech Recognition for spontaneous and prepared speech &
Speech Emotion Recognition in Portuguese

March 21, 2022

Fortaleza, Brazil

Collocated with PROPOR 2022

One-day workshop collocated with the 15th edition of the International Conference on the Computational Processing of Portuguese (PROPOR 2022), grouping two shared-tasks on the topic of Speech Processing in Portuguese.

The workshop aims to bring together new researchers and enthusiasts of speech processing in Portuguese, graduate students, industry professionals, computational linguists, as well as artificial intelligence researchers from our community.

Update: Speech Emotion Recognition - Task Results (Final Ranking)

Presentation

Automatic Speech Recognition (ASR) is an active research field which can be used in important applications, such as personal digital assistants, automatic call-based services, automatic subtitling in videos, among others. Recent advances in the field have shown that the quality of ASR systems begins to become similar to human capabilities. However, a state-of-the-art ASR system requires large training data and computational resources. In particular, there is a scarcity of resources for the Portuguese language, including publicly available datasets and models.

The Brazilian Portuguese (BP) language was struggling with only a few dozen hours of public data available for training and evaluating ASR and Synthesis systems (TTS) until the middle of 2020. The previous open dataset to train speech models in BP were much smaller than American English datasets, with only 10 hours for TTS and 60 hours for ASR. In the second half of 2020, three new datasets were made available (Alencar and Alcaim, 2008, Pratap et al., 2020, Ardila et al., 2020), resulting in 376 hours of non-conversational audio files. In 2021, new releases (the Multilingual TeDx Corpus (Salesky et al., 2021) and a new version of the dataset Common Voice (version 7.0)) improved ASR resources to 574 hours.

However, there is still a lack of datasets with audio files that record spontaneous speech of various genres, from interviews to informal dialogues and conversations, i.e., conversational speech recorded in natural contexts and noisy environments to train robust ASR systems. Spontaneous speech presents several phenomena such as laughter, coughs, filled pauses, word fragments due from repetitions, restarts and revisions of the discourse. This gap makes it difficult to develop both high-quality dialog systems and automatic speech recognition systems capable of handling spontaneous speech recorded in noisy environments, which are harder to be processed by ASR systems.

Besides, ASR is not the only research area to benefit from the availability of public benchmarks.

Speech Emotion Recognition (SER) is a multidisciplinary area of study and has received much attention over the last decade. Recognizing a speaker's emotional state from their speech is helpful for many applications, such as diagnostic tools for therapists, improving voice assistants, and analyzing communications in call centers (Schuller, 2018, Akçay and Oğuz, 2020).

Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022) Workshop introduces two versions of a new dataset called CORAA (Corpus of Annotated Audios - https://sites.google.com/viw/tarsila-c4ai) built in the TaRSila project, an effort of the Center for Artificial Intelligence (C4AI - http://c4ai.inova.usp.br/pt/nlp2-pt/), for a first edition of a new series of shared-tasks for the Portuguese language.

In this first edition we propose two shared tasks: ASR for Spontaneous and Prepared Speech, and Speech Emotion Recognition.

Paper Submission & Presentation

Call For Papers

Submissions should be system description papers related to the shared-task chosen. All papers may consist of 4 to 8 pages of content plus 2 pages for references, formatted using the CEUR template. Papers do not need to be anonymized.

Upon acceptance, all papers will be given two additional content pages to address reviewers’ comments. All papers should be submitted via easychair.

Accepted papers will be published as CEUR Workshop Proceedings (CEUR-WS.org) and will be presented at the SE&R 2022 Workshop either orally or as a poster. To publish the paper, it is necessary to register for the Propor conference and commit to presenting the work.

Camera-Ready instructions

Language of the event

English

Paper Presentation Instructions:

(1) All paper presentations will be 20 minutes long, followed by a 5-minute-long Q&A session with the audience. Your presentation can be in Portuguese, but your slides should be written in English.

(2) Send the PDF slides of your presentation, no later than 21st March at 10:00 AM,

to:

Ricardo Marcacini <ricardo.marcacini@usp.br> and

Arnaldo Candido Junior <arnaldocan@gmail.com>

(3) On the day of the SE&R 2022 Workshop you will present the slides from your machine.

Have a PDF copy of the slides handy.

(4) Enter the ZOOM session 15 minutes before the Workshop starts.

Invited Speaker

Arnaldo Candido Jr.

Federal University of Technology - Paraná (UTFPR), Brazil

Arnaldo Candido Junior is PhD in Computer Science and Computational Mathematics, currently working as a professor at the Federal University of Technology - Paraná. The researcher works in the areas of Artificial Intelligence and Natural Language Processing with focus on audio processing (recognition, synthesis, speaker identification, voice pathologies), deep neural networks, and machine learning. Arnaldo participates in audio processing projects such as Tarsila (audio resources compilation and open model creation for the Portuguese language) and SPIRA (COVID-19 detection using voice). The researcher also collaborates with Brazilian AI groups such as C4AI (Centre for Artificial Intelligence) and CEIA (Center of Excellence in Artificial Intelligence).

References

V. F. S. Alencar and A. Alcaim, LSF and LPC - derived features for Large Vocabulary Distributed Continuous Speech Recognition. In Brazilian Portuguese, 2008 42nd Asilomar Conference on Signals, Systems and Computers, 2008, pp. 1237-1241, doi: 10.1109/ACSSC.2008.5074614.

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020. doi: 10.21437/Interspeech.2020-2826.

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. https://aclanthology.org/2020.lrec-1.520

Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W Oard, and Matt Post. The multilingual TEDx corpus for speech recognition and translation, 2021.

Björn W Schuller. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5):90–99, 2018. https://doi.org/10.1145/3129340.

Mehmet Berkehan Akçay and Kaya Oğuz. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116:56–76, 2020. https://doi.org/10.1016/j.specom.2019.12.001.