End-to-End Video-To-Speech synthesis using Generative Adversarial Networks


Rodrigo Mira 1 Pingchuan Ma 1 Konstantinos Vougioukas 1 Stavros Petridis 1,2 Björn Schuller 1,3 Maja Pantic 1,4

1 Imperial College London

2 Samsung AI Centre Cambridge

3 ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

4 Facebook London



Table of contents

Abstract

Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks which translates spoken video to waveform audio end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture which generates audio from raw video, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses (LPower, LMFCC and LPerceptual) helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW, featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios - seen and unseen speakers - using four objective metrics which measure the quality and intelligibility of artificial speech. We show that the proposed approach outperforms all previous works in most metrics on GRID and LRW.


File name lists (train/val/test) for the datasets used in the paper:

Test Samples

(If you require full test samples for comparison, please contact rs2517(at)ic.ac.uk)

Seen speaker GRID (speakers 1, 2, 4 and 29)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o MFCC loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss, w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss, w/o MFCC loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o MFCC, w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Waveform critic

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Power critic

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Wave critic, w/o Power critic

allsamples.mp4

Example spectrogram (first sample in the video)

Lip2Audspec (Akbari et al. 2018)

allsamples.mp4

Example spectrogram (first sample in the video)

Gan-based (Vougioukas et al. 2019)

allsamples.mp4

Example spectrogram (first sample in the video)

Vocoder-based (Michelsanti et al. 2020)

allsamples.mp4

Example spectrogram (first sample in the video)

Unseen speaker GRID (all speakers)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Gan-based (Vougioukas et al. 2019)

allsamples.mp4

Example spectrogram (first sample in the video)

Vocoder-based (Michelsanti et al. 2020)

allsamples.mp4

Example spectrogram (first sample in the video)

LRW (Lipreading in the wild - full dataset)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Seen speaker TCD-TIMIT (3 lipspeakers)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Silent video experiment

website_video.mp4

Example spectrogram (first sample in the video)