Video-to-Speech

End-to-End Video-To-Speech synthesis using Generative Adversarial Networks

Rodrigo Mira ¹ Pingchuan Ma ¹Konstantinos Vougioukas ¹ Stavros Petridis ^1,2 Björn Schuller ^1,3 Maja Pantic ^1,4

¹Imperial College London

²Samsung AI Centre Cambridge

³ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany

⁴Facebook London

Table of contents

Table of contents

Link to paper: https://arxiv.org/abs/2104.13332

File name lists (train/val/test) for the datasets used in the paper:

https://drive.google.com/file/d/1BiMSaryRzR3Q1WMfa9Lil2-6RR40MvkE/view?usp=sharing

Seen speaker GRID (speakers 1, 2, 4 and 29)

w/o Perceptual loss

w/o Perceptual loss, w/o Power loss

w/o Perceptual loss, w/o MFCC loss

w/o MFCC, w/o Power loss

w/o Waveform critic

w/o Power critic

w/o Wave critic, w/o Power critic

Lip2Audspec (Akbari et al. 2018)

Gan-based (Vougioukas et al. 2019)

Vocoder-based (Michelsanti et al. 2020)

Unseen speaker GRID (all speakers)

Gan-based (Vougioukas et al. 2019)

Vocoder-based (Michelsanti et al. 2020)

LRW (Lipreading in the wild - full dataset)

Seen speaker TCD-TIMIT (3 lipspeakers)

Silent video experiment

Abstract

Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks which translates spoken video to waveform audio end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture which generates audio from raw video, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses (L_Power, L_MFCC and L_Perceptual) helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW, featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios - seen and unseen speakers - using four objective metrics which measure the quality and intelligibility of artificial speech. We show that the proposed approach outperforms all previous works in most metrics on GRID and LRW.

Link to paper: https://arxiv.org/abs/2104.13332

File name lists (train/val/test) for the datasets used in the paper:

https://drive.google.com/file/d/1BiMSaryRzR3Q1WMfa9Lil2-6RR40MvkE/view?usp=sharing

Test Samples

(If you require full test samples for comparison, please contact rs2517(at)ic.ac.uk)

Seen speaker GRID (speakers 1, 2, 4 and 29)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o MFCC loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss, w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Perceptual loss, w/o MFCC loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o MFCC, w/o Power loss

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Waveform critic

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Power critic

allsamples.mp4

Example spectrogram (first sample in the video)

w/o Wave critic, w/o Power critic

allsamples.mp4

Example spectrogram (first sample in the video)

Lip2Audspec (Akbari et al. 2018)

allsamples.mp4

Example spectrogram (first sample in the video)

Gan-based (Vougioukas et al. 2019)

allsamples.mp4

Example spectrogram (first sample in the video)

Vocoder-based (Michelsanti et al. 2020)

allsamples.mp4

Example spectrogram (first sample in the video)

Unseen speaker GRID (all speakers)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Gan-based (Vougioukas et al. 2019)

allsamples.mp4

Example spectrogram (first sample in the video)

Vocoder-based (Michelsanti et al. 2020)

allsamples.mp4

Example spectrogram (first sample in the video)

LRW (Lipreading in the wild - full dataset)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Seen speaker TCD-TIMIT (3 lipspeakers)

Real audio

allsamples.mp4

Example spectrogram (first sample in the video)

Full model

allsamples.mp4

Example spectrogram (first sample in the video)

Silent video experiment

website_video.mp4

Example spectrogram (first sample in the video)

Page updated

Google Sites

Report abuse