Multimodal Emotion Recognition Competition 2020 (MERC 2020)
Challenge Overview
1. Background
The recent development of artificial intelligence (AI) has substantially improved performance in speech and image recognition. AI is now possible to implement higher cognitive functions such intention, emotion, personality, trust, situational awareness, and decision making.
As the first step to get into the higher cognitive functions, we are interested in the recognition of human emotion based on audio, image, and language. In the future the emotion recognition will become more important as a core technology to develop intelligent human–computer interactions.
2. Objective
This competition aims at developing the state-of-the-art models for emotion recognition from emotion-labelled facial videos with facial, speech, and text data. Participants are expected to put their efforts for obtaining the highest accuracy of recognizing one out of 7-classes, i.e., the neural and 6 emotions.
All participants will be given multimodal emotional videos. In each video, a Korean actor shows facial and vocal expressions of a targeted emotion while speaking a given script. For each video, three types of emotions, one for facial video, another for audio, and the other for the integrated emotion. All these emotion labels are tagged by human evaluators. The participants are expected to submit their multimodal integration recognition results for the test dataset.
3. Schedule
September 11th, 2020
Open baseline code and three datasets, i.e., the train, validation, and the first test datasets.
November 6th, 2020
Open the second test dataset.
November 23rd, 2020
Open the third test dataset.
Final submission deadline.
November 27th, 2020
Winner Talks
4. Datasets
4.1. How to download datasets
Fill out this agreement.
Send your agreement with your Eval-AI ID and team information to merc.kaist@gmail.com.
When your agreement is confirmed, a private download link will be provided to your email address
4.2. About the datasets
Each video in the datasets consists of multimodal data consisting of facial videos, speech, and a text sequence.
To provide fair competition among participants, the words are given as the word2vec embedding vectors. Therefore, the datasets will not contain any information in the form of human-readable text.
5. Baseline model
The following repositories are the implementation of a baseline model. Participants can use this implementation. The organizer offers two repositories. One contains a multimodal emotion recognition model and the other does a speech recognition model. The multimodal model does not contain the speech model but utilizes the outputs of the speech model.
Multimodal emotion recognition model (video, speech, text): https://github.com/ki4ai-yhs/merc2020
Speech emotion recognition model: https://github.com/ki4ai-skc/qia2020
To receive awards, participants' model should outperform the baseline model. The following is the performance of the baseline model on the three test sets.
(Accuracy of the first test set) = 0.376819
(Accuracy of the second test set) = 0.370509
(Accuracy of the third test set) = 0.370231
6. Evaluation
You have to submit your prediction results on the three datasets (Tests 1, 2, and 3) to the MERC2020 EvalAI page. Each test dataset has its own evaluation period, which is described in this EvalAI page.
6.1. Submission files for evaluation
Submissions are evaluated by the classification accuracy on the whole three test datasets.
Submission files should contain two headers named as 'FileID' and 'Emotion'.
In the 'FileID' column, 5-digit strings of file IDs should be written. In the 'Emotion' column, 3-character strings (hap, sad, ang, sur, dis, fea, neu) of emotions should be written.
See the sample.csv file for details.
6.2. Evaluation of final performance
The ranks of participating teams will be based on the weighted average scores of accuracies on the three test datasets as
FinalScore = 0.3 * (Accuracy of the first test set)
+ 0.3 * (Accuracy of the second test set)
+ 0.4 * (Accuracy of the third test set)
7. Prize
The rank of participating teams will be determined by the order of their final scores.
1st ranked: 2000$
2nd ranked: 1000$
3rd ranked: 500$
7.1. Awardee Qualifications
Only teams that meet the following qualifications are eligible for the awards. If there are no team with the qualifications, there may be no Awardee. Personal information (passport number, account number, etc.) will be requested to endow the Awards.
Their models must outperform the baseline model at the final score.
They must submit their source codes to verify the submitted results had been generated by the codes through online. (This verification will be only required for the candidates of the Awards.)
8. Awardees
9. Winner Talks
1st Place Winner Talk by Dmitrii Tsybulevskii (Team: u1234x1234)
QnA
Q1. On Slide 11, I was not able to understand "the first 70% of the pretrained speaker recognition models." Particularly, what does "the first 70%" mean?
A1. The neural network has a linear structure (i. e. sequence of layers: Conv2d, BatchNorm, Linear, ...). During training, the first 70% of the pretrained weights were freezed. That means these weights are NOT updated at all. In terms of pytorch: requires_grad=False
https://discuss.pytorch.org/t/how-the-pytorch-freeze-network-in-some-layers-only-the-rest-of-the-training/7088/2 Example: suppose you have a network with 10 trainable layers. If you freeze the first 70%, the weights of the 7 layers will not be updated at all, but the weights of the last 3 layers will be learned as usual. This technique allows to reduce the capacity of the model making it harder to overfit on the small datasets.
Q2. On Slide 11, how do you decide to use "70%" rather than other percentages? Is there any literature to infer the number "70"?
A2. I tried 0% (no freezing) and few other thresholds 30, 50, 70. 70% worked the best on this particular task (lower validation score), there is no any magic in this number. It may be is not an optimal solution, but it just works. In general it makes sense to try different thresholds (e.g. 30, 50, 70, ...) and select the one which worked the best (empirically), by doing some analysis (a/b tests). Unfortunately I'm not aware of any papers with a grounded theoretical justification.
2nd Place Winner Talk by Seongwoong Jo (Team: SeongwoongJo)
QnA
Q1. On Slide 8, you said you used linear warm-up scheduling. I am just curious of the hyperparameters of that scheduling.
A1. Linear warmup scheduler increase the learning rate from 0 to initialize lr on the first epoch. This scheduler stablizes the training on the adam-based optimizer. (There is instability issue on the Adam-based optimizer, and the warmup scheduler is one of the solution)
Q2. Could you introduce any literature I can dig into to understand deeply "instability issue on the Adam-based optimizer, and the warmup scheduler is one of the solution"?
A2. https://arxiv.org/pdf/1910.04209.pdf .This is one of the latest paper that explains the instability of adaptive optimizer. In this paper, there were several optimizer that fixes the instability like RAdam, but the problem is yet remained.
Q3. On Slide 7, you said 'multitask loss.' Does it mean the summed loss of integrate, face, and speech labels? If so, does you enforce different weights on those losses to construct the total training loss?
A3. I used the loss = integrate_loss + lambda*(face_loss + speech_loss)/2, and tried different lambda as 0.5,1,2 . The best lambda was 2, which is simple summation of each loss.
Q4. Did you use only those three lambdas in loss?
A4. Yes, I think that if lambda is bigger than 2, face_loss and speech loss will dominate integrate loss, so the training will not be able to focused on the integrate loss. Also, I didn't have too much time... so I picked the representative three values, 0.5,1,2.
3rd Place Winner Talk by Huynh Van Thong, Soohyung Kim (Team: PRLAIC)
QnA
Q1. In the training section, what kind of warm-up did you use? Linear?
A1. Yes, we use linear function to warm-up the learning rate
Q2. In your objective function, you didn't specify what alpha, beta, gamma and delta are. Would you provide any numbers of them? And could you share the final, saturated alpha, beta, gamma and delta?
A2. The value of alpha, beta, gamma and delta are learnable during the training. The initial for these values are 1 which assume the equal contribution of each loss/modal. The values for alpha (audio), beta (face), gamma (text) and delta (video) are approximately: 0.19, 0.16, 0.22, 0.36
Data Description
1. Test phase 1
Capacity: 70.3 GB
In Test Phase 1 types of datasets are provided: train, validation, and test 1 datasets.
Ratio of the given datasets
Train : Val : Test 1 = 8 : 1 : 0.33
It is allowed to merge the validation dataset into the train dataset.
In each dataset, 7 emotion labels are equally distributed. In other words, labels of those datasets are balanced.
Files descriptions in merc2020-1.tgz
merc2020-1
├── train.csv
├── train_face.csv
├── train_speech.csv
├── val.csv
├── val_face.csv
├── val_speech.csv
├── test1
├── train
└── val
train.csv
A CSV files containing file IDs and their integrated emotion labels of the training dataset
Two columns: FileID and Emotion
train_face.csv
A CSV files containing file IDs and their face emotion labels of the training dataset
Two columns: FileID and Emotion
train_speech.csv
A CSV files containing file IDs and their speech emotion labels of the training dataset
Two columns: FileID and Emotion
val.csv
A CSV files containing file IDs and their integrated emotion labels of the validation dataset
Two columns: FileID and Emotion
val_face.csv
A CSV files containing file IDs and their face emotion labels of the validation dataset
Two columns: FileID and Emotion
val_speech.csv
A CSV files containing file IDs and their speech emotion labels of the validation dataset
Two columns: FileID and Emotion
test1
This test dataset is recorded by actors not included in the training and validation datasets.
A directory containing mp4 videos and npz files of the test 1 dataset
Name of mp4 videos: {FileID}-{WHRatio}.mp4
Example: 08342-3.mp4
FileID: uniquely attached to files.
WHRatio: This number indicates some width-height information.
WHRatio=3, then (width, height)=(1280,720).
WHRatio=4, then (width, height)=(720,1280).
Name of npz files: {FileID}.npz
Example: 08342-3.npz
FileID: uniquely attached to files.
See down below to learn how to import this file.
train
A directory containing mp4 videos and npz files of the training dataset
Name: {FileID}-{WHRatio}-{PersonID}-{gender}-{age}-{utteranceID}-{IntegratedEmotion}-{FaceEmotion}-{SpeechEmotion}.mp4
Example: 51377-4-090-w-67-051-neu-dis-neu.mp4
FileID: uniquely attached to files.
WHRatio: This number indicates some width-height information.
WHRatio=3, then (width, height)=(1280,720).
WHRatio=4, then (width, height)=(720,1280).
IntegratedEmotion: Integrated emotion
This value can be one of 7 emotion values: 'neu', 'hap', 'ang', 'fea', 'dis', 'sur', 'sad'
FaceEmotion: Face emotion
SpeechEmotion: Speech emotion
Name of npz files: {FileID}.npz
Example: 08342-3.npz
FileID: uniquely attached to files.
See down below to learn how to import this file.
val
A directory containing mp4 videos and npz files of the validation dataset
Name of mp4 videos: {FileID}-{WHRatio}-{PersonID}-{gender}-{age}-{utteranceID}-{IntegratedEmotion}-{FaceEmotion}-{SpeechEmotion}.mp4
Example: 52215-4-092-m-35-047-neu-neu-sad.mp4
Name of npz files: {FileID}.npz
Example: 08342-3.npz
FileID: uniquely attached to files.
See down below to learn how to import this file.
2. Test phase 2
Capacity: 2.69 GB
Ratio of the given datasets
Train : Val : Test 1 : Test 2 = 8 : 1 : 0.33 : 0.33
Files descriptions in merc2020-2.tgz
A directory containing mp4 videos and npz files of the test 2 dataset
This test dataset is recorded by actors not included in the training and validation datasets.
Name of mp4 videos: {FileID}-{WHRatio}.mp4
Example: 08342-3.mp4
FileID: uniquely attached to files.
WHRatio: This number indicates some width-height information.
WHRatio=3, then (width, height)=(1280,720).
WHRatio=4, then (width, height)=(720,1280).
Name of npz files: {FileID}.npz
Example: 08342-3.npz
FileID: uniquely attached to files.
See down below to learn how to import this file.
3. Test Phase 3
Capacity: 2.70 GB
Ratio of the given datasets
Train : Val : Test 1 : Test 2 : Test 3 = 8 : 1 : 0.33 : 0.33 : 0.33
Files descriptions in merc2020-3.tgz
This test dataset is recorded by actors not included in the training and validation datasets.
A directory containing mp4 videos and npz files of the test 3 dataset
Name of mp4 videos: {FileID}-{WHRatio}.mp4
Example: 51866-4.mp4
FileID: uniquely attached to files.
WHRatio: This number indicates some width-height information.
WHRatio=3, then (width, height)=(1280,720).
WHRatio=4, then (width, height)=(720,1280).
Name of npz files: {FileID}.npz
Example: 08342-3.npz
FileID: uniquely attached to files.
See down below to learn how to import this file.
4. npz file
This file type is included in train/val/test1/test2/test3 directory.
npz files include a numpy.ndarray matrix.
Example
import numpy as np
npz = np.load('11411.npz')
word_level_embedding_vector = npz['word_embed']
word_level_embedding_vector
Word level embedding vector
Type: numpy.ndarray
Shape: (text_morphs_length, 200)