Shared-tasks

Registration & Submission

Shared-Task on ASR for spontaneous and prepared speech

Task Description:

This challenge aims to motivate research in the ASR area, increasing the number of available ASR models for Portuguese and also motivating young researchers to experiment with resources of this exciting research area.

DATA:

The CORAA ASR dataset version 1.1 is composed of 290.79 hours of transcribed audios from spontaneous and prepared speech. It includes European Portuguese (2.68 hours) and Brazilian Portuguese (the remaining hours). This dataset contains four main Brazilian accents (São Paulo State Cities, Minas Gerais, Recife, São Paulo Capital) but also includes speakers from many different regions of Brazil.

Training and development sets are released for this shared-task, and the models generated by the participants will be evaluated using a test set, which will be publicly available after the challenge.

Suggested Additional Data for Submissions:

In the proposed task, participants train their own models on the resources made available specifically for the challenge and can also use other open resources as well:

CETUC

Common Voice

Multilingual LibriSpeech

Multilingual TEDx

These are only suggestions; any publicly available additional data or pretrained models are permitted. Submitted models can be trained on closed-data (using the provided data only, e.g., no models pretrained on external data) or open-data. Participants will be asked to inform whether their models use open-data or closed data.

Baseline:

We provide a strong baseline, consisting of a pre-trained version of the Wav2Vec 2.0 model. More specifically, this pre-trained model is based on Wav2Vec 2.0 XLSR-53. The baseline is available as an Hugging Face model. We strongly recommend that participants use this model for transfer learning.

Evaluation:

Each participant can submit up to four models submissions and can assign these models to four subtasks:

- Mixed (all datasets)
- Prepared Speech PT_BR (TEDx Portuguese)
- Prepared Speech PT_PT (TEDx Portuguese)
- S
pontaneous Speech (ALIP, C-Oral Brasil, SP2010, NURC-RE*)
* (also contains prepared speech)

The results will be released by subtask ranked by the metric CER (Character Error Rate), although WER (Word Error Rate) will also be reported. In particular, results for models participating in the Spontaneous subtask will be reported by accent.

Speech Emotion Recognition

Task Description:

Here, we present the Brazilian Portuguese Speech Emotion Recognition Task. This task aims to motivate research for SER in our community, mainly to discuss theoretical and practical aspects of SER, pre-processing and feature extraction, and machine learning models for Portuguese.

DATA:

We provide a dataset called CORAA SER version 1.0 composed of approximately 50 minutes of audio segments labeled in three classes: neutral, non-neutral female, and non-neutral male. While the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech. This dataset was built from the C-ORAL-BRASIL I corpus (Raso and Mello, 2012). The available corpus consists of audio segments representing Brazilian Portuguese informal spontaneous speech. The non-neutral emotion class was labeled considering paralinguistic elements (laughing, crying, etc). Participants can use pre-trained models and external data, as long as the original C-ORAL-BRASIL corpus (or variants) is not used for model training.

In this task, participants must train their own models using acoustic audio features. A training set is available. The models trained by the participants will be evaluated in a test set, which will be made publicly available after the challenge.

Baselines:

We provide two baseline models. The first baseline uses a set of prosodic audio features for emotion classification. In the second baseline, we use the Wav2Vec model to extract features (i.e. embeddings) from the audio segments. These features can be used for training a speech emotion recognition classifier.

Evaluation:

Each participant can submit up to three models. The Macro F1 Score measure will be used to evaluate the models.

References

(Raso and Mello, 2012) Tommaso Raso and Heliana Mello. The C-ORAL-BRASIL I: reference corpus for informal spoken Brazilian Portuguese. In International Conference on Computational Processing of the Portuguese Language, pages 362–367. Springer, 2012.

Registration

Registration for SE&R 2022 Shared-tasks on ASR for Spontaneous and Prepared Speech and Speech Emotion Recognition is now open: Registration Form.

Please, remind that to publish the paper about your models, it is necessary to register for the Propor 2022 conference and commit to presenting your work ! See more details on Paper Submission.

Submission

Submissions for SE&R 2022 Shared-tasks on ASR for Spontaneous and Prepared Speech and Speech Emotion Recognition is now open, please check the Submission Page for more details.