Zerrin Yumak, Associate Professor, Utrecht University, NL
Biography: Dr. Zerrin Yumak is an Associate Professor in the Department of Information and Computing Sciences at Utrecht University, where she also directs the Motion Capture and Virtual Reality Lab. In addition, she serves as one of the program coordinators for the Master’s in Game and Media Technology. Her research centers on 3D digital humans, with a particular focus on their capacity for social interaction—a field she has been active in for nearly two decades. She obtained her PhD from the University of Geneva in Switzerland, and held research positions at both the Swiss Federal Institute of Technology in Lausanne and Nanyang Technological University in Singapore before settling in the Netherlands in 2015. Her research follows two main lines. The first is AI-driven motion synthesis: using motion capture data and deep learning algorithms to automatically generate facial expressions and gestures for digital humans in games and XR experiences. The second line goes a step further—exploring how digital humans engage in real-time social interaction and evaluating them from a perceptual standpoint. Her publications appeared in conferences and journals such as ICCV, Computer Graphics Forum and Computers & Graphics and also won Best Paper Awards. She is the Program Chair and Steering Committee member of the ACM Intelligent Virtual Agents conference and co-organizer of the MASSXR Workshop@IEEE VR on Multi-modal Affective and Social Behavior Analysis and Synthesis in Extended Reality. She is also Advisory Board Member of the Creative Industries Immersive Impact Coalition (CIIIC) in the Netherlands. She has been an invited speaker for several conferences and her work also appeared on international media.
Title: From Realistic to Relatable: Teaching Digital Humans to Communicate
Abstract: With recent advancements in computer graphics, 3D digital humans have achieved an impressive level of visual realism. They are increasingly being integrated into diverse applications, including video games, customer service and finance chatbots, educational and healthcare simulations, remote communication, and social extended reality (XR). However, their ability to interact and move naturally within social contexts remains limited. As humans, we are highly attuned to non-verbal behaviors in emotional and social interactions. For digital humans to engage with us more naturally, they must be equipped with non-verbal communication skills such as facial expressions, gestures, and gaze. As they are deployed in more interactive environments, the demand to generate their behavior automatically and in real time is growing. Yet, capturing and synthesizing the nuanced, individual nature of non-verbal behaviors remains a significant challenge, hindered by limitations in data availability, the complexity of algorithms, and evaluation methodologies. In this talk, I will explore how AI and deep learning techniques—particularly those leveraging motion capture technology—can be used to model and generate the non-verbal behaviors of digital humans. I will present the state-of-the-art, highlight our recent research, and offer a critical examination of current evaluation practices in non-verbal behavior synthesis.
Siyang Song, Lecturer, University of Exeter, UK
Biography: Siyang Song is a Lecturer (Assistant Professor) at the University of Exeter. He received his PhD in the Computer Vision Lab and Horizon Center for Doctoral Training at the University of Nottingham, UK. His research interests lie in automatic human behaviour understanding including facial expression, personality and depression assessment, as well as automatic human behaviour generation such as facial reaction generation. Since 2022, Siyang has led a new research topic called Multiple Appropriate Facial Reaction Generation. He has published more than 60 papers in top-tier journals and conferences such as AAAI, ACM Multimedia, CVPR, ECCV, ICCV, IJCAI, NeurIPS, IEEE Trans. on Affective Computing, IEEE Trans. on Image Processing, IEEE Trans. on Robotics, and chaired more than 10 international challenges/workshops such as REACT 2023-2025, AVEC 2019, AHRI 2022-2024.
Title: Multiple Appropriate Facial Reaction Generation: Concept, Methodology, Dataset and Challenges
Abstract: Human behavioral reactions are stimulated by context, where people will process the received stimulus and produce an appropriate behavioural reaction. This implies that in a specific context for a given input stimulus, an individual can react differently according to their internal state and other contextual factors. Analogously, in dyadic interactions, humans communicate using verbal and nonverbal cues, where multiple human non-verbal facial reactions might be appropriate for responding to a specific speaker behaviour. This talk will start with defining the Multiple Appropriate Facial Reaction Generation (MAFRG) concept. Then, the talk will introduce a set of objective evaluation metrics to evaluate the appropriateness of the generated reactions, the newly established MARS dataset, recently developed methodologies, and finally discuss the challenges and potential opportunities for this research direction.
Claudio Ferrari, Assistant Professor, University of Siena, IT
Short Bio: Claudio Ferrari is currently a tenure-track assistant professor at the Department of Information Engineering and Mathematics of the University of Siena. Previously, he was assistant professor at the University of Parma and research fellow at the University of Florence. He has also been a visiting research scholar at the University of Southern California (USC). He actively conducts research on human behavior analysis and generation, spanning from 3D/4D face reconstruction and animation, facial recognition and editing, multi-modal generation to emotion analysis and recognition, on which topics he has co-authored more than 40 papers in renowned conferences and journals. He was co-chair and organizer of several workshops and tutorials on such topics. He also serves as Associate Editor for the IEEE Trans. on Circuits and Systems for Video Technology, and served as Area Chair for the ACM International Conference on Multimedia.
Title: Prospects and Challenges in 3D Face Modeling and Animation
Abstract: Representing, modeling and animating 3D human faces is a long standing problem in computer vision and machine learning, with a broad range of possible applications spanning from entertainment, graphics, biometrics to the medical field. Despite being studied for decades, the complexity of the problem still challenges researchers, and the room for improvement is large. A lot is yet to be done. In this talk, we’ll delve into the current solutions, challenges and prospects of the task of modeling and animating 3D faces. The talk will introduce state-of-the-art techniques for representing and modeling the geometry of faces, and how such techniques were leveraged to develop solutions for 3D/4D animation, with a specific focus on facial expressions and speech generation (Talking Heads). From a different perspective, the talk will also address another thorny problem: how can we accurately measure progress? While properly measuring error is of paramount importance for accurate assessment of a method performance, current options face several limitations and fall short under different perspectives.