Eighth Emotion Recognition in the Wild Challenge (EmotiW)

ACM International Conference on Multimodal Interaction 2020

To be updated!

EmotiW 2020 challenge is now over. However, the datasets mentioned below are available for academic research. Please write to emotiw2014@gmail.com with database name in subject.

1. Engagement Prediction in the Wild

The task here is to predict the engagement intensity of a subject in a video. During the recording session, subject watched an educational video (MOOC). The average duration of the video is 5 minutes. The intensity of engagement has been not engaged (distracted) and highly engaged. The data has been recorded in diverse conditions and across different environments.

Baseline [1]: The headpose and eye gaze features features are extracted using the OpenFace 0.23 library. The video is divided into segments. Each segment is represented by computing standard deviation across the head movement directions from the frames present in a segment. The eye gaze movement is also represented as average variance of the points returned for gaze of left and right eye in a particular segment compared to the mean eye points of the video. As a result, both the eye and head pose features are concatenated resulting in a 9 dimensional feature vector. Each video is represented using collection of segments where each segment is represented as a fused feature having information of head pose and eye gaze. These features are passed through a LSTM layer, which returns activation for each segment of the video, passed to the flatten layer and then flattened feature vector is passed to the network of three dense layers followed by average pooling which gives the regressed value of engagement level of a video. The intensities (as available on Google drive and OneDrive) have been quantised between the range [0-1] as 0,0.33,0.66 & 1 for the intensity levels.

To clarify the LSTM was trained on the quantised values. Note that this is a regression problem. The MSE on the validation set is 0.10.

2. Audio-video Group Emotion Recognition

The audio-video group emotion recognition challenge contains group videos downloaded from YouTube with creative commons license. The data has lot of variations in terms of context, number of people, video quality, etc. The task is to classify each video in three classes - positive, neutral and negative. For the baseline results, model trained on image based group emotion database, is used to extract the visual features. For audio features, ComParE challenge feature set are used. The fusion of visual and audio level features has 50.05% accuracy in the validation set [5]. The evaluation protocol is classification accuracy. More details can be found in the mentioned baseline paper.

3. Driver Gaze Prediction (data website - https://sites.google.com/view/drivergazeprediction/home)

The task here is to predict the zone of the car, where the driver is looking at. The data has been recorded in different illumination settings. Figure 1 (below) shows the sample images. The task is to classify into nine zones. The challenge is based on the Driver Gaze in the Wild (DGW) dataset with over 330 subjects. The data has been recorded in different illumination conditions. The classes correspond to zones marked inside a car cabin. The data has been collected in a Hyundai car with different subjects at the driver’s position. We pasted number stickers representing different zones in the car. The nine car zones represent areas of back mirror, side mirrors, radio, speedometer and windshield. The recording sensor is a Microsoft Lifecam RGB camera, which contains a microphone as well. The baseline of the data is computed by training an Inception V1 network. The classification accuracy on the Validation set is 56.0%. Baseline paper for this sub-challenge is not available [4].

4. Physiological signal based emotion recognition challenge

This physiological signal based emotion recognition challenge contains physiological signals (such as electrodermal activity (EDA)) collected while observers watching movie short clips. These movie clips are from the the Acted Facial Expressions In The Wild (AFEW) dataset. The labels of the physiological signals are the same as the corresponding clips. This task aims to predict an emotion label for each physiological signal series from 7 emotions, i.e., Anger, Disgust, Fear, Happy, Neutral, Sad and Surprise. Classification accuracy is employed as the measurement. The baseline for this sub-challenge is based on a three layer fully connected neural network and achieves an accuracy of 42.1%. Details are provided in the report [3].


Figure 1: Driver gaze zone prediction sub-challenge.

1. A. Kaur, A. Mustafa, L. Mehta and A. Dhall, Prediction and Localization of Student Engagement in the Wild, Digital Image Computing: Techniques and Applications (DICTA) 2018.
2. A. Dhall, R. Goecke, S. Lucey and T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, IEEE Multimedia 2012 [PDF].
3. Y. Liu, T. Gedeon, S. Caldwell, S. Lin and Z. Lin, Emotion Recognition Through Observer’S Physiological Signals https://arxiv.org/pdf/2002.08034.pdf
4. S. Ghosh, A. Dhall, G. Sharma, S. Gupta and N. Sebe, Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset. arXiv preprint arXiv:2004.05973. https://arxiv.org/abs/2004.05973
5. G. Sharma, A. Dhall, and J. Cai, “Audio-visual automatic group affect analysis,” IEEE Transactions on Affective Computing, 2021.