End-to-End Video-To-Speech synthesis using Generative Adversarial Networks
Rodrigo Mira 1 Pingchuan Ma 1 Konstantinos Vougioukas 1 Stavros Petridis 1,2 Björn Schuller 1,3 Maja Pantic 1,4
1 Imperial College London
2 Samsung AI Centre Cambridge
3 ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany
4 Facebook London
Table of contents
Abstract
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks which translates spoken video to waveform audio end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture which generates audio from raw video, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses (LPower, LMFCC and LPerceptual) helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW, featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios - seen and unseen speakers - using four objective metrics which measure the quality and intelligibility of artificial speech. We show that the proposed approach outperforms all previous works in most metrics on GRID and LRW.
Link to paper: https://arxiv.org/abs/2104.13332
File name lists (train/val/test) for the datasets used in the paper:
Test Samples
(If you require full test samples for comparison, please contact rs2517(at)ic.ac.uk)
Seen speaker GRID (speakers 1, 2, 4 and 29)
Real audio
Example spectrogram (first sample in the video)
Full model
Example spectrogram (first sample in the video)
w/o Perceptual loss
Example spectrogram (first sample in the video)
w/o Power loss
Example spectrogram (first sample in the video)
w/o MFCC loss
Example spectrogram (first sample in the video)
w/o Perceptual loss, w/o Power loss
Example spectrogram (first sample in the video)
w/o Perceptual loss, w/o MFCC loss
Example spectrogram (first sample in the video)
w/o MFCC, w/o Power loss
Example spectrogram (first sample in the video)
w/o Waveform critic
Example spectrogram (first sample in the video)
w/o Power critic
Example spectrogram (first sample in the video)
w/o Wave critic, w/o Power critic
Example spectrogram (first sample in the video)
Lip2Audspec (Akbari et al. 2018)
Example spectrogram (first sample in the video)
Gan-based (Vougioukas et al. 2019)
Example spectrogram (first sample in the video)
Vocoder-based (Michelsanti et al. 2020)
Example spectrogram (first sample in the video)
Unseen speaker GRID (all speakers)
Real audio
Example spectrogram (first sample in the video)
Full model
Example spectrogram (first sample in the video)
Gan-based (Vougioukas et al. 2019)
Example spectrogram (first sample in the video)
Vocoder-based (Michelsanti et al. 2020)
Example spectrogram (first sample in the video)
LRW (Lipreading in the wild - full dataset)
Real audio
Example spectrogram (first sample in the video)
Full model
Example spectrogram (first sample in the video)
Seen speaker TCD-TIMIT (3 lipspeakers)
Real audio
Example spectrogram (first sample in the video)
Full model
Example spectrogram (first sample in the video)
Silent video experiment
Example spectrogram (first sample in the video)