Imperial College London, UK
We propose a temporal GAN capable of producing animated faces using only a still image of a person and an audio clip containing speech. Our method goes directly from raw audio to video, without requiring additional post-processing steps. Our network uses two types of discriminator:
The videos generated using our method exhibit head movements and have natural facial expressions such as frowns and blinks.
Faces generated using our method exhibit realistic facial expressions such as blinks, brow movements (e.g. frowns, lowered or raised eyebrows) and head movements (e.g. slight turns, nods). The following videos contain some characteristic examples of such expressions.