首页

Content

Training Strategies for Improved Lip-reading, ICASSP 2022

Towards Practical Lipreading with Distilled and Efficient Models, ICASSP 2021

Lipreading using Temporal Convolutional Networks, ICASSP 2020

End-to-end Audiovisual Speech Recognition, ICASSP 2018

Training Strategies for Improved Lip-reading

(ICASSP 2022)

Pingchuan Ma , Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic

Imperial College London, Meta AI

Several training strategies and temporal models have been recently proposed for isolated word lip-reading in a series of independent works. However, the potential of combining the best strategies and investigating the impact of each of them has not been explored. In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundaries indicators. Our results show that Time Masking (TM) is the most important augmentation followed by mixup and Densely-Connected Temporal Convolutional Networks (DC-TCN) are the best temporal model for lip-reading of isolated words. Using self-distillation and word boundaries is also beneficial but to a lesser extent. A combination of all the above methods results in a classification accuracy of 93.4%, which is an absolute improvement of 4.6% over the current state-of-the-art performance on the LRW dataset. The performance can be further improved to 94.1% by pre-training on additional datasets. An error analysis of the various training strategies reveals that the performance improves by increasing the classification accuracy of hard-to-recognise words.

The results obtained with the proposed models on the LRW dataset are the following:

Visual model (Conv3d+ResNet-18+DC-TCN) : 91.6%

Visual model (Conv3d+ResNet-18+DC-TCN) + Boundary: 93.4%

Visual model (Conv3d+ResNet-18+DC-TCN) + Boundary + Additional training data: 94.1%

[Paper], [Training Code], [Model]

Towards Practical Lipreading with Distilled and Efficient Models

(ICASSP 2021)

Pingchuan Ma , Brais Martinez, Stavros Petridis, Maja Pantic

Imperial College London, Samsung AI Research Center

Lipreading has witnessed a lot of progress due to the resurgence of neural networks. Recent work has placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization. However, there is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios. In this work, we propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance on two standard datasets such as LRW and LRW-1000 by a wide margin through careful optimization. Secondly, we propose a series of architectural changes, including a novel depthwise-separable TCN head, that slashes the computational cost to a fraction of the (already quite efficient) original model. Thirdly, we show that knowledge distillation is a very effective tool for recovering performance of the lightweight models. This results in a range of models with different accuracy-efficiency trade-offs. However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of almost 10x in terms of computational cost, which we hope will enable the deployment of lipreading models in practical applications.

The results obtained with the proposed models on the LRW dataset are the following:

Visual model (Conv3d+ResNet-18+MS-TCN): 88.5%

Lightweight visual model (Conv3d+ShuffleNet v2+DS-MS-TCN): 85.3% (9.3M Params, 1.27G FLOPs*)

Lightweight visual model (Conv3d+ShuffleNet v2+TCN): 79.9% (2.9M Params, 0.66G FLOPs*)

The results obtained with the proposed models on the LRW-1000 dataset are the following:

Visual model (Conv3d+ResNet-18+MS-TCN): 46.6%

Lightweight visual model (Conv3d+ShuffleNet v2+TCN): 41.4% (4.0M Params, 1.81G FLOPs**)

Lightweight visual model (Conv3d+ShuffleNet v2+DS-TCN): 40.2% (1.6M Params, 0.84G FLOPs**)

* We use a sequence of 29-frame with a size of 88 by 88 pixels to compute FLOPs for LRW dataset.

** We use a sequence of 29-frame with a size of 112 by 112 pixels to compute FLOPs for LRW-1000 dataset.

[Paper], [Training Code], [Model]

Lipreading using Temporal Convolutional Networks

(ICASSP 2020)

Brais Martinez , Pingchuan Ma , Stavros Petridis, Maja Pantic

Samsung AI Research Center, Imperial College London

Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-theart model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU layers are replaced with Temporal Convolutional Networks (TCN). Secondly, we greatly simplify the training procedure, which allows us to train the model in one single stage. Thirdly, we show that the current state-of-the-art methodology produces models that do not generalize well to variations on the sequence length, and we addresses this issue by proposing a variable-length augmentation. We present results on the largest publiclyavailable datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively. Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets which is the new state-of-the-art performance.

The results obtained with the proposed model on the LRW dataset are the following:

Audio model (End-to-End): 98.5%

Visual model (End-to-End): 85.3%

Audiovisual model (End-to-End): 99.0%

The result obtained with the proposed model on the LRW-1000 dataset is the following:

Visual model (End-to-End): 41.4%

[Paper], [Code], [Model]

End-to-end Audiovisual Speech Recognition

(ICASSP 2018)

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, Maja Pantic

Imperial College London, University of Nottingham

This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.

The results obtained with the proposed model on the LRW dataset* are the following:

Audio model (End-to-End): 97.72%

Visual model (End-to-End): 83.39%

Audiovisual model (End-to-End): 98.38%

At the moment, this is the state-of-the-art performance for each modality on LRW.

*Coordinates(LRW) are (x1, y1, x2, y2) = (80, 116, 175, 211)

The results are slightly better than the ones reported in the ICASSP paper due to further fine-tuning of the models.

[Paper], [Code], [Model]

Google Sites

Report abuse