RIDGE
Rule-Infused Deep Learning for Realistic Co-Speech Gesture Generation
RIDGE
Rule-Infused Deep Learning for Realistic Co-Speech Gesture Generation
Ghazanfar Ali Hwangyoun Kim Jae-In Hwang
Korea Institute of Science and Technology
Abstarct
Co-speech gestures are essential for natural human communication, yet existing synthesis methods fall short in delivering semantically aligned and contextually appropriate motions. In this paper, we present \textbf{RIDGE}, a hybrid system that combines rule-based and deep learning approaches to generate realistic gestures for virtual avatars and human-computer interaction. RIDGE employs a high-fidelity rule base generated from motion capture data with the assistance of large language models, to select reliable gesture mappings. When a high-confidence match is not available, a contrastively trained deep learning model steps in to produce semantically appropriate gestures. Evaluated using a novel Gesture Cluster Affinity (GCA) metric, our system outperforms existing baselines, achieving a GCA score of 0.73 compared to rule-based baseline 0.6 and end-to-end: 0.52, while ground truth score was 0.90. Detailed analyses of system architecture, data preprocessing, and evaluation methodologies demonstrate RIDGE’s potential to enhance gesture synthesis.