We propose a temporal GAN capable of producing animated faces using only a still image of a person and an audio clip containing speech. Our method goes directly from raw audio to video, without requiring additional post-processing steps. Our network uses two types of discriminator:
- Frame Discriminator: The Frame Discriminator evaluates individual frames taken from synthetic/real sequences. This drives the Generator to produce frames that are detailed.
- Sequence Discriminator: The Sequence discriminator evaluates sequence -audio pairs to determine if they are real or synthetic. This drives the audio and video to be in sync and encourages the generation of facial expressions (e.g. blinks)
The videos generated using our method exhibit head movements and have natural facial expressions such as frowns and blinks.
Faces generated using our method exhibit realistic facial expressions such as blinks, brow movements (e.g. frowns, lowered or raised eyebrows) and head movements (e.g. slight turns, nods). The following videos contain some characteristic examples of such expressions.