Seventh Emotion Recognition in the Wild Challenge (EmotiW)

ACM International Conference on Multimodal Interaction 2019

1. Engagement Prediction in the Wild

The task here is to predict the engagement intensity of a subject in a video. During the recording session, subject watched an educational video (MOOC). The average duration of the video is 5 minutes. The intensity of engagement has been not engaged (distracted) and highly engaged. The data has been recorded in diverse conditions and across different environments.

Baseline [1]: The headpose and eye gaze features features are extracted using the OpenFace 0.23 library. The video is divided into segments. Each segment is represented by computing standard deviation across the head movement directions from the frames present in a segment. The eye gaze movement is also represented as average variance of the points returned for gaze of left and right eye in a particular segment compared to the mean eye points of the video. As a result, both the eye and head pose features are concatenated resulting in a 9 dimensional feature vector. Each video is represented using collection of segments where each segment is represented as a fused feature having information of head pose and eye gaze. These features are passed through a LSTM layer, which returns activation for each segment of the video, passed to the flatten layer and then flattened feature vector is passed to the network of three dense layers followed by average pooling which gives the regressed value of engagement level of a video. The intensities (as available on Google drive and OneDrive) have been quantized between the range [0-1] as 0,0.33,0.66 & 1 for the intensity levels.

To clarify the LSTM was trained on the quantized values. Note that this is a regression problem. The MSE on the validation set is 0.10.

2. Audio-video Emotion Recognition

The video based emotion recognition challenge contains audio-video short clips labelled using a semi-automatic approach defined in [2]. This challenge is continuation from the challenge in EmotiWs 2013-16. The task is to assign a single emotion label to the video clip from the six universal emotions (Anger, Disgust, Fear, Happiness, Sad & Surprise) and Neutral. Classification accuracy is the comparison metric. The baseline for the sub-challenge is based on computing LBPTOP descriptor. SVR is trained and the Val accuracy is 38.81%.

3. Group-based Cohesion Prediction

The cohesiveness of a group is an essential indicator of the emotional state, structure and success of a group of people. The task is to predict the perceived cohesiveness of a group of people in image. The paper [3] describing the task is available on [LINK].


1. A. Kaur, A. Mustafa, L. Mehta and A. Dhall, Prediction and Localization of Student Engagement in the Wild, Digital Image Computing: Techniques and Applications (DICTA) 2018.
2. A. Dhall, R. Goecke, S. Lucey and T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, IEEE Multimedia 2012 [PDF].
3. S. Ghosh, A. Dhall, N. Sebe and T. Gedeon, Predicting Cohesiveness in Images, International Joint Conference on Neural Network 2019.