Multimodal Emotion Recognition Competition 2020 (MERC 2020)

Challenge Overview

1. Background

The recent development of artificial intelligence (AI) has substantially improved performance in speech and image recognition. AI is now possible to implement higher cognitive functions such intention, emotion, personality, trust, situational awareness, and decision making.

As the first step to get into the higher cognitive functions, we are interested in the recognition of human emotion based on audio, image, and language. In the future the emotion recognition will become more important as a core technology to develop intelligent human–computer interactions.

2. Objective

This competition aims at developing the state-of-the-art models for emotion recognition from emotion-labelled facial videos with facial, speech, and text data. Participants are expected to put their efforts for obtaining the highest accuracy of recognizing one out of 7-classes, i.e., the neural and 6 emotions.

All participants will be given multimodal emotional videos. In each video, a Korean actor shows facial and vocal expressions of a targeted emotion while speaking a given script. For each video, three types of emotions, one for facial video, another for audio, and the other for the integrated emotion. All these emotion labels are tagged by human evaluators. The participants are expected to submit their multimodal integration recognition results for the test dataset.

3. Schedule

September 11th, 2020

  • Open baseline code and three datasets, i.e., the train, validation, and the first test datasets.

November 6th, 2020

  • Open the second test dataset.

November 23rd, 2020

  • Open the third test dataset.

  • Final submission deadline.

November 27th, 2020

  • Winner Talks

4. Datasets

4.1. How to download datasets

  1. Fill out this agreement.

  2. Send your agreement with your Eval-AI ID and team information to merc.kaist@gmail.com.

  3. When your agreement is confirmed, a private download link will be provided to your email address

4.2. About the datasets

  • Each video in the datasets consists of multimodal data consisting of facial videos, speech, and a text sequence.

  • To provide fair competition among participants, the words are given as the word2vec embedding vectors. Therefore, the datasets will not contain any information in the form of human-readable text.

5. Baseline model

The following repositories are the implementation of a baseline model. Participants can use this implementation. The organizer offers two repositories. One contains a multimodal emotion recognition model and the other does a speech recognition model. The multimodal model does not contain the speech model but utilizes the outputs of the speech model.

To receive awards, participants' model should outperform the baseline model. The following is the performance of the baseline model on the three test sets.

  • (Accuracy of the first test set) = 0.376819

  • (Accuracy of the second test set) = 0.370509

  • (Accuracy of the third test set) = 0.370231

6. Evaluation

You have to submit your prediction results on the three datasets (Tests 1, 2, and 3) to the MERC2020 EvalAI page. Each test dataset has its own evaluation period, which is described in this EvalAI page.

6.1. Submission files for evaluation

  • Submissions are evaluated by the classification accuracy on the whole three test datasets.

  • Submission files should contain two headers named as 'FileID' and 'Emotion'.

  • In the 'FileID' column, 5-digit strings of file IDs should be written. In the 'Emotion' column, 3-character strings (hap, sad, ang, sur, dis, fea, neu) of emotions should be written.

  • See the sample.csv file for details.

6.2. Evaluation of final performance

The ranks of participating teams will be based on the weighted average scores of accuracies on the three test datasets as

FinalScore = 0.3 * (Accuracy of the first test set)

+ 0.3 * (Accuracy of the second test set)

+ 0.4 * (Accuracy of the third test set)

7. Prize

The rank of participating teams will be determined by the order of their final scores.

  • 1st ranked: 2000$

  • 2nd ranked: 1000$

  • 3rd ranked: 500$

7.1. Awardee Qualifications

Only teams that meet the following qualifications are eligible for the awards. If there are no team with the qualifications, there may be no Awardee. Personal information (passport number, account number, etc.) will be requested to endow the Awards.

  • Their models must outperform the baseline model at the final score.

  • They must submit their source codes to verify the submitted results had been generated by the codes through online. (This verification will be only required for the candidates of the Awards.)

8. Awardees

9. Winner Talks

1st Place Winner Talk by Dmitrii Tsybulevskii (Team: u1234x1234)

Presentation file link

QnA

Q1. On Slide 11, I was not able to understand "the first 70% of the pretrained speaker recognition models." Particularly, what does "the first 70%" mean?

A1. The neural network has a linear structure (i. e. sequence of layers: Conv2d, BatchNorm, Linear, ...). During training, the first 70% of the pretrained weights were freezed. That means these weights are NOT updated at all. In terms of pytorch: requires_grad=False

https://discuss.pytorch.org/t/how-the-pytorch-freeze-network-in-some-layers-only-the-rest-of-the-training/7088/2 Example: suppose you have a network with 10 trainable layers. If you freeze the first 70%, the weights of the 7 layers will not be updated at all, but the weights of the last 3 layers will be learned as usual. This technique allows to reduce the capacity of the model making it harder to overfit on the small datasets.

Q2. On Slide 11, how do you decide to use "70%" rather than other percentages? Is there any literature to infer the number "70"?

A2. I tried 0% (no freezing) and few other thresholds 30, 50, 70. 70% worked the best on this particular task (lower validation score), there is no any magic in this number. It may be is not an optimal solution, but it just works. In general it makes sense to try different thresholds (e.g. 30, 50, 70, ...) and select the one which worked the best (empirically), by doing some analysis (a/b tests). Unfortunately I'm not aware of any papers with a grounded theoretical justification.

2nd Place Winner Talk by Seongwoong Jo (Team: SeongwoongJo)

Presentation file link

QnA

Q1. On Slide 8, you said you used linear warm-up scheduling. I am just curious of the hyperparameters of that scheduling.

A1. Linear warmup scheduler increase the learning rate from 0 to initialize lr on the first epoch. This scheduler stablizes the training on the adam-based optimizer. (There is instability issue on the Adam-based optimizer, and the warmup scheduler is one of the solution)

Q2. Could you introduce any literature I can dig into to understand deeply "instability issue on the Adam-based optimizer, and the warmup scheduler is one of the solution"?

A2. https://arxiv.org/pdf/1910.04209.pdf .This is one of the latest paper that explains the instability of adaptive optimizer. In this paper, there were several optimizer that fixes the instability like RAdam, but the problem is yet remained.

Q3. On Slide 7, you said 'multitask loss.' Does it mean the summed loss of integrate, face, and speech labels? If so, does you enforce different weights on those losses to construct the total training loss?

A3. I used the loss = integrate_loss + lambda*(face_loss + speech_loss)/2, and tried different lambda as 0.5,1,2 . The best lambda was 2, which is simple summation of each loss.

Q4. Did you use only those three lambdas in loss?

A4. Yes, I think that if lambda is bigger than 2, face_loss and speech loss will dominate integrate loss, so the training will not be able to focused on the integrate loss. Also, I didn't have too much time... so I picked the representative three values, 0.5,1,2.

3rd Place Winner Talk by Huynh Van Thong, Soohyung Kim (Team: PRLAIC)

presentation file link

QnA

Q1. In the training section, what kind of warm-up did you use? Linear?

A1. Yes, we use linear function to warm-up the learning rate

Q2. In your objective function, you didn't specify what alpha, beta, gamma and delta are. Would you provide any numbers of them? And could you share the final, saturated alpha, beta, gamma and delta?

A2. The value of alpha, beta, gamma and delta are learnable during the training. The initial for these values are 1 which assume the equal contribution of each loss/modal. The values for alpha (audio), beta (face), gamma (text) and delta (video) are approximately: 0.19, 0.16, 0.22, 0.36

Data Description

1. Test phase 1

  • Capacity: 70.3 GB

  • In Test Phase 1 types of datasets are provided: train, validation, and test 1 datasets.

  • Ratio of the given datasets

    • Train : Val : Test 1 = 8 : 1 : 0.33

  • It is allowed to merge the validation dataset into the train dataset.

  • In each dataset, 7 emotion labels are equally distributed. In other words, labels of those datasets are balanced.

  • Files descriptions in merc2020-1.tgz


merc2020-1

├── train.csv

├── train_face.csv

├── train_speech.csv

├── val.csv

├── val_face.csv

├── val_speech.csv

├── test1

├── train

└── val


  • train.csv

    • A CSV files containing file IDs and their integrated emotion labels of the training dataset

    • Two columns: FileID and Emotion

  • train_face.csv

    • A CSV files containing file IDs and their face emotion labels of the training dataset

    • Two columns: FileID and Emotion

  • train_speech.csv

    • A CSV files containing file IDs and their speech emotion labels of the training dataset

    • Two columns: FileID and Emotion

  • val.csv

    • A CSV files containing file IDs and their integrated emotion labels of the validation dataset

    • Two columns: FileID and Emotion

  • val_face.csv

    • A CSV files containing file IDs and their face emotion labels of the validation dataset

    • Two columns: FileID and Emotion

  • val_speech.csv

    • A CSV files containing file IDs and their speech emotion labels of the validation dataset

    • Two columns: FileID and Emotion

  • test1

    • This test dataset is recorded by actors not included in the training and validation datasets.

    • A directory containing mp4 videos and npz files of the test 1 dataset

    • Name of mp4 videos: {FileID}-{WHRatio}.mp4

      • Example: 08342-3.mp4

      • FileID: uniquely attached to files.

      • WHRatio: This number indicates some width-height information.

        • WHRatio=3, then (width, height)=(1280,720).

        • WHRatio=4, then (width, height)=(720,1280).

    • Name of npz files: {FileID}.npz

      • Example: 08342-3.npz

      • FileID: uniquely attached to files.

      • See down below to learn how to import this file.

  • train

    • A directory containing mp4 videos and npz files of the training dataset

    • Name: {FileID}-{WHRatio}-{PersonID}-{gender}-{age}-{utteranceID}-{IntegratedEmotion}-{FaceEmotion}-{SpeechEmotion}.mp4

      • Example: 51377-4-090-w-67-051-neu-dis-neu.mp4

      • FileID: uniquely attached to files.

      • WHRatio: This number indicates some width-height information.

        • WHRatio=3, then (width, height)=(1280,720).

        • WHRatio=4, then (width, height)=(720,1280).

      • IntegratedEmotion: Integrated emotion

        • This value can be one of 7 emotion values: 'neu', 'hap', 'ang', 'fea', 'dis', 'sur', 'sad'

      • FaceEmotion: Face emotion

      • SpeechEmotion: Speech emotion

    • Name of npz files: {FileID}.npz

      • Example: 08342-3.npz

      • FileID: uniquely attached to files.

      • See down below to learn how to import this file.

  • val

    • A directory containing mp4 videos and npz files of the validation dataset

    • Name of mp4 videos: {FileID}-{WHRatio}-{PersonID}-{gender}-{age}-{utteranceID}-{IntegratedEmotion}-{FaceEmotion}-{SpeechEmotion}.mp4

      • Example: 52215-4-092-m-35-047-neu-neu-sad.mp4

    • Name of npz files: {FileID}.npz

        • Example: 08342-3.npz

        • FileID: uniquely attached to files.

        • See down below to learn how to import this file.

2. Test phase 2

  • Capacity: 2.69 GB

  • Ratio of the given datasets

    • Train : Val : Test 1 : Test 2 = 8 : 1 : 0.33 : 0.33

  • Files descriptions in merc2020-2.tgz

    • A directory containing mp4 videos and npz files of the test 2 dataset

    • This test dataset is recorded by actors not included in the training and validation datasets.

    • Name of mp4 videos: {FileID}-{WHRatio}.mp4

      • Example: 08342-3.mp4

      • FileID: uniquely attached to files.

      • WHRatio: This number indicates some width-height information.

        • WHRatio=3, then (width, height)=(1280,720).

        • WHRatio=4, then (width, height)=(720,1280).

    • Name of npz files: {FileID}.npz

      • Example: 08342-3.npz

      • FileID: uniquely attached to files.

      • See down below to learn how to import this file.

3. Test Phase 3

  • Capacity: 2.70 GB

  • Ratio of the given datasets

    • Train : Val : Test 1 : Test 2 : Test 3 = 8 : 1 : 0.33 : 0.33 : 0.33

  • Files descriptions in merc2020-3.tgz

    • This test dataset is recorded by actors not included in the training and validation datasets.

    • A directory containing mp4 videos and npz files of the test 3 dataset

    • Name of mp4 videos: {FileID}-{WHRatio}.mp4

      • Example: 51866-4.mp4

      • FileID: uniquely attached to files.

      • WHRatio: This number indicates some width-height information.

        • WHRatio=3, then (width, height)=(1280,720).

        • WHRatio=4, then (width, height)=(720,1280).

    • Name of npz files: {FileID}.npz

      • Example: 08342-3.npz

      • FileID: uniquely attached to files.

      • See down below to learn how to import this file.

4. npz file

  • This file type is included in train/val/test1/test2/test3 directory.

  • npz files include a numpy.ndarray matrix.

  • Example

import numpy as np
npz = np.load('11411.npz')
word_level_embedding_vector = npz['word_embed']

  • word_level_embedding_vector

    • Word level embedding vector

    • Type: numpy.ndarray

    • Shape: (text_morphs_length, 200)


Sponsors