Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance


Qingcheng Zhao1,2*   Pengyu Long1,2*   Qixuan Zhang1,2  Dafei Qin2,3   Han Liang1   Longwen Zhang1,2   Yingliang Zhang4   Jingyi Yu1  Lan Xu1 

1ShanghaiTech University        2Deemos Technology         3University of Hong Kong     4DGene Digital Technology Co., Ltd. 

* Equal Contribution    † Project Leader    Corresponding Author

Abstract 

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels.  Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.


Video

Overview 

We train a geometry VAE to learn a latent space of expression and head pose, disentangling expression with identity. Two vision encoders are trained to extract expression latent codes and head poses from RGB images, which enables us to capture a wide array of 4D data.

Our model takes audio features and CLIP latent code as conditions and denoise the noised sequence of expression latent code together with head pose i.e. head motion code. The conditions are randomly masked and subjected to cross-attention with the noisy head motion code. At inference, we sample head motion codes by DDIM. We feed the expression latent code to the GNPFA decoder to extract the expression geometry, combined with a model template to produce facial animation enhanced by head pose parameters.

Results

We generate vivid dialogue scenes (Row 1,2) through scripted textual descriptions. We synthesize stylized facial animations (Row 3,4) through image prompts, which can be emoji or even more abstract images. We also perform emotional singing in France, English and Japanese(Row 5-7). For more results, please refer to the supplementary video.

We can fine-tune the generated facial animation (Row 2) by 1. extracting key-frame expression latent codes through our expression encoder (Row 3), 2. providing per-frame style prompts through CLIP (Row 4, Left: happy, Right: Sad). The intensity and range of control can be adjusted using diffusion in-betweening techniques.

Thanks to GNPFA, we can further generate personalized and nuanced facial mesh, which can fit various identities across different genders, ages, and ethnicities. Note the differences in facial details among different identities, notably the different wrinkles.

Citation

Notice: You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any way exploit any such content, nor may you distribute any part of this content over any network, including a local area network, sell or offer it for sale, or use such content to construct any kind of database.