6th Emotion Recognition in the Wild Challenge (EmotiW)

ACM International Conference on Multimodal Interaction 2018, Colarado, USA.

a. Group-level emotion recognition

The task here is to classify a group's perceived emotion as Positive, Neutral or Negative. Social network users upload large number of images captured during social events over the internet. These images can be from positive social events such as convocations, marriages, party or neutral event such as meetings or negative events such as funeral, protests etc. The images in this sub-challenge are from the Group Affect Database 2.0 [1]. The data is distributed into three sets: Train, Validation and Test. The Train and Validation set will be shared in March along with the ground truth labels. The Test set labels will be hidden and the evaluation will be performed by participants sharing the generated test labels. During EmotiW 2016, the group-level task was inferring happiness intensity of a group of people in images. This year, the challenge has advanced to consider images from a broader Valence range.

For details on the Group-level emotion recognition, a talk by Prof. Goecke is available on video lectures at: http://videolectures.net/fgconference2015_goecke_people_images/

Baseline

We trained an Inception V3 network for the three class task. The overall classification accuracy is 0.65. Class-wise accuracy is 0.72 (Positive), 0.60 (Neutral) and 0.60 (Negative).

For the Test set the overall classification accuracy is 0.61. Class-wise accuracy is 0.75 (Positive), 0.50 (Neutral) and 0.53 (Negative).

Sample images in this sub-challenge and the corresponding labels [1]

b. Engagement in the Wild

The task here is to predict the engagement intensity of a subject in a video. During the recording session, subject watched an educational video (MOOC). The average duration of the video is 5 minutes. The intensity of engagement has been not engaged (distracted) and highly engaged. The data has been recorded in diverse conditions and across different environments.

Baseline: The headpose and eye gaze features have been uploaded on the OneDrive link. The features have been extracted using the OpenFace 0.23 library. The video is divided into segments. Each segment is represented by computing standard deviation across the head movement directions from the frames present in a segment. The eye gaze movement is also represented as average variance of the points returned for gaze of left and right eye in a particular segment compared to the mean eye points of the video. As a result, both the eye and head pose features are concatenated resulting in a 9 dimensional feature vector. Each video is represented using collection of segments where each segment is represented as a fused feature having information of head pose and eye gaze. These features are passed through a LSTM layer, which returns activation for each segment of the video, passed to the flatten layer and then flattened feature vector is passed to the network of three dense layers followed by average pooling which gives the regressed value of engagement level of a video. The intensities (as available on Google drive and OneDrive) have been quantised between the range [0-1] as 0,0.33,0.66 & 1 for the intensity levels.

To clarify the LSTM was trained on the quantized values. Note that this is a regression problem. The MSE on the validation set is 0.10.


c. Audio-video Sub-challenge

The video based emotion recognition challenge contains audio-video short clips labelled using a semi-automatic approach defined in [2]. This challenge is continuation from the challenge in EmotiWs 2013-16. The task is to assign a single emotion label to the video clip from the six universal emotions (Anger, Disgust, Fear, Happiness, Sad & Surprise) and Neutral. Classification accuracy is the comparison metric. The baseline for the sub-challenge is based on computing LBPTOP descriptor. SVR is trained and the Val accuracy is 38.81%.

References 
1. A. Dhall, J. Joshi, K. Sikka, R. Goecke and N. Sebe, The More the Merrier: Analysing the Affect of a Group of People In Images, IEEE FG 2015. [PDF] 
2. A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey and T. Gedeon, From Individual to Group-level Emotion Recognition: EmotiW 5.0, ACM ICMI 2017 [PDF]. 
3. A. Dhall, R. Goecke, S. Lucey and T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, IEEE Multimedia 2012 [PDF].