The challenge utilizes a subset of DIEM-A dataset, featuring 92 professional performers (49 Japanese, 43 Taiwanese). Unlike traditional datasets that rely on fixed scenes or actions, performers enacted 12 distinct emotions based on rich, self-created scenarios. This results in highly naturalistic variability in both physical expression and context. For each performer, three distinct scenarios per emotion were performed at three intensity levels. The 12 target classes include: 7 Basic Emotions (Joy, Sadness, Anger, Surprise, Fear, Disgust, Contempt) and 5 Social Emotions (Gratitude, Guilt, Jealousy, Shame, Pride). The dataset is partitioned as follows:
Data from 74 performers (40 Japanese, 34 Taiwanese), including:
2,664 Scenario Descriptions: The contextual scripts in their original Japanese/Chinese, alongside English translations.
7,992 Motion Sequences: Provided in .bvh, .fbx, and .c3d formats.
Data from the remaining 18 performers (9 Japanese, 9 Taiwanese), including:
1,944 Motion Sequences: Provided in .bvh, .fbx, and .c3d formats. Teams will generate predictions for this hidden set to compete on the leaderboard.
In preliminary testing, we used a classic model (STGCN [1] using the PYSKL [2] improvements), trained only on motion capture data. We used a leave-performer-out 10-fold cross validation as testing method, giving the following results:
Accuracy: 27.1% (SD 3.7%)
F1: 25.2% (SD 4.5%)
For more detail on the implementation, see the code at Github.
Participants must build a model to predict the specific emotion the performer intended to portray across the 12 basic and social categories. Teams will submit their test set predictions alongside their system description paper by the deadline. The organizing committee will rank submissions using Macro-F1 and Accuracy.
Understanding how these models work is a core scientific goal of this workshop. We strongly encourage teams to include an explainability analysis in their system description paper, to show us how your model derives emotion from body movement. Recommended approaches include, but are not limited to:
Analyze prioritized kinematic features.
Identify specific body parts crucial for classification.
Visualize temporal or spatial attention maps.
The approach is fully open! Participants can decide which signals to use and which methods to apply:
Methodology: Whether you use classic feature-engineering techniques or deep learning, the goal is to explore innovative ways to decode cross-cultural emotional expressions.
Modality: Teams are welcome to train their models using only the motion data, or by fusing both the motion data and the provided text scenarios (context).
External Resources: Teams may use external datasets or pre-trained models, provided these resources are publicly available and explicitly documented in the submitted paper.
Final results will be presented at the MMAC Workshop, held in conjunction with ACII 2026. We will present two distinct awards:
Best Performance Award: Awarded to the team that achieves the highest Macro-F1 and Accuracy scores on the hidden test set.
Best Explainability Award: Awarded to the paper that most effectively explains its model's emotion inference process. The winner will be determined by a combined scoring mechanism: a pre-workshop evaluation of the paper's scientific rigor by the Program Committee, and a live audience vote assessing the clarity of the on-site presentation.
For data access and submission guidelines, please visit the Participation & Submission page.