The field of multimedia computing has seen extensive research on the automatic recognition of human emotions, encompassing diverse modalities such as image, audio, text, and video analysis. This line of research plays a pivotal role in advancing our understanding of human behavior. Over the past few decades, significant progress has been made in individual-level emotion recognition. Notably, social science studies have demonstrated that humans tend to adjust their reactions and behaviors based on their perception of others’ emotions. In response to this insight, research interest in group-level emotion recognition (GER) has grown substantially in recent years. Unlike individual-level emotion recognition, which centers on detecting emotions of single individuals, GER focuses on collectively identifying emotions expressed by groups of people. Beyond its theoretical value, GER also carries profound societal significance across multiple domains—including social behavior analysis, public security, and human-robot interaction. However, analyzing emotions at the group level presents greater challenges than at the individual level, primarily due to the complexity and dynamic nature of unrestricted GER scenarios in open environments.
1) Adaptive Key Role Guided Hierarchical Relation Inference for Enhanced Group-level Emotion Recognition
Our latest research paper, titled “Adaptive Key Role Guided Hierarchical Relation Inference for Enhanced Group-level Emotion Recognition,” has been officially published in IEEE Transactions on Affective Computing—a top-tier journal in the field of affective computing and human-computer interaction. This work presents a novel framework to address long-standing challenges in group-level emotion recognition (GER), pushing the boundaries of how we model collective emotional states in complex social scenarios.
Motivation: Emotion recognition has long been a focus of multimedia computing, but most prior research centers on individual-level emotion analysis. In real-world settings—from public security monitoring and social behavior studies to human-robot interaction—understanding group-level emotions (e.g., identifying whether a crowd at a protest is tense or a gathering at a festival is joyful) is far more impactful. However, GER is inherently challenging: it requires more than aggregating individual emotions—it demands capturing subtle interactions between people, balancing global scene context, and filtering out irrelevant or conflicting emotional cues (e.g., a smiling bystander in a somber crowd).
Method: To address these gaps, we propose Key Role Guided Hierarchical Relation Inference (KR-HRI)—a hierarchical relational network that adaptively identifies “key individuals” (those most representative of the group’s emotion) and models interactions at multiple granularities. Here’s how it works:
1. Multi-Branch Feature Extraction
We extract features from three complementary branches to capture rich context:
Face Branch: Detects and processes facial expressions using ResNet18 (pre-trained on MS-Celeb-1M).
Object Branch: Identifies scene objects (e.g., balloons, signs) with Faster R-CNN (trained on MSCOCO) and VGG19.
Scene Branch: Captures global setting information using VGG19 (pre-trained on ImageNet).
We also introduce Spatial Positional Encoding (SPE) to embed location information (e.g., where a person stands in a crowd) into features—critical for understanding social interactions.
2. Hierarchical Relation Inference Module (HRIM)
HRIM models interactions in a “coarse-to-fine” manner to isolate key individuals:
Global Relation Module (GRM, Coarse-Grained): Uses self-attention to model initial interactions between all individuals, building a baseline understanding of collective emotional dynamics.
Refined Relation Module (RRM, Fine-Grained): Guided by global scene context, RRM selects key individuals via a differentiable Gumbel-Softmax sampling strategy (with replacement). This avoids rigid “top-k” selection, ensuring we prioritize people who truly drive the group’s emotion while preserving smooth gradient flow for training.
3. Multi-Branch Interaction Module (MIM)
To fuse features from the face, object, and scene branches effectively, MIM enhances a standard Transformer encoder with a localized mask layer. This mask constrains attention to neighboring regions, balancing local individual details with global context—ensuring no critical interaction (e.g., two people conversing) is overlooked.
We validated KR-HRI on three benchmark GER datasets (GAFF2, GAFF3, GroupEmoW) and consistently outperformed state-of-the-art methods:
On GAFF2: Achieved 79.34% average accuracy (AVE) and 79.91% unweighted average recall (UAR) in the Face+Scene setting—outperforming the best prior method by 0.33% (AVE) and 0.9% (UAR).
On GAFF3: Reached 81.17% AVE and 81.29% UAR in the Face+Scene+Object setting, with a 2.12% improvement in negative emotion recognition (a common bottleneck).
On GroupEmoW: Surpassed leading methods by 1.75% (AVE) and 1.82% (UAR), excelling in neutral emotion recognition (90.28% accuracy—5% higher than prior work).
Ablation studies further confirmed the value of each component: HRIM (coarse-to-fine modeling) and MIM (dynamic fusion) together contributed a 4.6% UAR gain over baseline models.
KR-HRI provides a robust framework for GER by prioritizing meaningful social interactions over generic feature aggregation. Its adaptive key role selection and dynamic context fusion can be extended to related tasks—such as crowd behavior analysis, multi-party conversation emotion recognition, and even AI-driven social robotics.
We are excited to share this work with the community and look forward to exploring its broader applications in understanding human collective behavior.
2) Towards a robust group-level emtoion recognition via uncertainty-aware learning
We are excited to announce that our research paper, “Towards a Robust Group-Level Emotion Recognition via Uncertainty-Aware Learning”, has been officially published in IEEE Transactions on Affective Computing (Volume 16, Number 3, July-September 2025)—a top-tier journal in the fields of affective computing and human behavior analysis. This work addresses long-standing challenges in group-level emotion recognition (GER) by focusing on the inherent uncertainties of real-world scenarios, offering a more robust solution that outperforms existing state-of-the-art methods across key benchmarks.
Group-level emotion recognition— which identifies collective emotional states (such as a joyful festival crowd or a tense protest gathering)—plays a vital role in social behavior analysis, public security, and human-robot interaction. However, existing GER methods face two major limitations that hinder their real-world applicability. First, they rely on deterministic feature representations for individuals, ignoring the uncertainties common in unconstrained environments. These uncertainties include crowd congestion, facial occlusion (e.g., one person blocking another’s face), and variable lighting—all of which corrupt data quality and lead to unreliable emotional cues. Second, GER tasks only provide group-level labels (e.g., “neutral” for an entire scene), but individuals within the same group often show conflicting emotions (e.g., a smiling person in a neutral crowd). Traditional methods fail to account for this inconsistency, leading to confused and inaccurate group emotion inferences.
To solve these challenges, we developed a three-branch UAL framework that explicitly models uncertainty throughout the GER process—from feature extraction to final emotion prediction. This framework ensures more robust, adaptable, and reliable recognition by turning limitations (uncertainty, noisy data) into strengths.
1. Multi-Source Feature Extraction
We capture complementary emotional cues from three dedicated branches, ensuring we do not rely on a single data source:
Face Branch: Uses MTCNN (a multi-task cascaded convolutional network) to detect faces, then leverages ResNet18 (pre-trained on the MS-Celeb-1M dataset) to extract detailed facial expression features— the most direct indicator of individual emotion.
Object Branch: Employs Faster R-CNN (trained on the MSCOCO dataset) to identify scene objects (e.g., balloons, signs) and uses VGG19 (pre-trained on ImageNet) to process their features. Objects often provide critical context for group emotions (e.g., balloons signaling a positive event).
Scene Branch: Uses VGG19 to extract global scene features (e.g., the setting of a wedding vs. a funeral), which establish the overall emotional tone of the group.
2. Uncertainty Modeling: Moving Beyond Fixed Representations
The core of our UAL framework is replacing traditional fixed “point-based” individual features with probabilistic Gaussian distribution embeddings. For every face or object, we represent its feature as a Gaussian distribution, where the mean captures the core feature (e.g., a smile’s key visual traits) and the variance quantifies uncertainty (e.g., higher variance for an occluded face).
To ensure the model can learn from this probabilistic representation, we use Monte Carlo sampling (optimized at 15 samples) and a reparameterization trick to generate diverse, stochastic features. This not only captures the natural variability of emotional expressions but also ensures the model generalizes better to new scenarios. We also calculate “uncertainty-sensitive scores” from the variance, which act as adaptive weights during feature aggregation—automatically downweighting individuals with high uncertainty (e.g., heavily occluded faces) to avoid distorting the group emotion prediction.
3. Image Enhancement: Filtering Noise at the Source
Low-quality face samples (e.g., blurry or unrecognizable faces) often degrade GER performance. We integrate a face quality assessment module (based on SER-FIQ) that computes a quality score for each detected face using pairwise distances between stochastic embeddings. Samples with scores below a threshold (set to 0.3) are filtered out, ensuring only reliable faces enter the feature extraction pipeline—reducing noise at the data level to complement our uncertainty modeling.
4. Adaptive Fusion: Proportional-Weighted Fusion Strategy (PWFS)
To combine outputs from the three branches effectively, we avoid the rigid, pre-defined weights used in traditional methods. Instead, our PWFS dynamically assigns weights based on each branch’s relative predictive strength for a given input. For example, if the face branch generates a more confident prediction for a scene, it receives a higher weight. This ensures we leverage complementary information from all branches without overfitting, resulting in more accurate final group emotion predictions.
We validated our UAL framework on three widely used GER datasets, demonstrating its effectiveness and robustness across different scenarios:
GAFF2 Dataset: Achieved 79.32% average recall, 79.52% unweighted average recall (UAR), and 79.23% F-measure. Notably, we outperformed key state-of-the-art methods (such as Fujii et al.’s 2020 hierarchical framework) in neutral emotion recognition by 2.67% UAR, a common bottleneck in GER.
GAFF3 Dataset: Surpassed the current best method (Fujii et al., 2020) with improvements of 0.96% in recall, 0.49% in precision, and 0.81% in F-measure. This success stems from our ability to model uncertainty, which helps the model handle the larger, more diverse data in GAFF3.
MultiEmoVA Dataset: Even with its smaller sample size (270 images) and class imbalance, our method achieved 61.22% UAR and 60.77% F-measure—outperforming the previous best method (Huang et al., 2022) by 6.82% UAR and Mou et al.’s 2015 baseline by 21.26% UAR.
Ablation studies further confirmed the value of each component: the UAL module and image enhancement module together contributed a 4.6% UAR gain over baseline models, while our PWFS outperformed fixed-weight fusion strategies by 2.4% in average recall on GAFF2.
Our UAL framework provides a new paradigm for GER by treating uncertainty as a manageable, informative signal rather than a noise to ignore. Its adaptive uncertainty modeling and dynamic fusion can be extended to related tasks, such as crowd behavior analysis, multi-party conversation emotion recognition, and AI-driven social robotics—helping build more reliable emotionally intelligent systems.