Multi-grained Spatio-Temporal Features Perceived Network for Event-based Lip-Reading

[paper]  [dataset]  [code

Abstract

     Automatic lip-reading (ALR) aims to recognize words using visual information from the speaker's lip movements. In this work, we introduce a novel type of sensing device, event cameras, for the task of ALR. Event cameras have both technical and application advantages over conventional cameras for the ALR task because they have higher temporal resolution, less redundant visual information, and lower power consumption. To recognize words from the event data, we propose a novel Multi-grained Spatio-Temporal Features Perceived Network (MSTP) to perceive fine-grained spatio-temporal features from microsecond time-resolved event data. Specifically, a multi-branch network architecture is designed, in which different grained spatio-temporal features are learned by operating at different frame rates. The branch operating on the low frame rate can perceive spatial complete but temporal coarse features. While the branch operating on the high frame rate can perceive spatial coarse but temporal refinement features. And a message flow module is devised to integrate the features from different branches, leading to perceiving more discriminative spatio-temporal features. In addition, we present the first event-based lip-reading dataset (DVS-Lip) captured by the event camera. Experimental results demonstrated the superiority of the proposed model compared to the state-of-the-art event-based action recognition models and video-based lip-reading models.

Visualization of event frames of different temporal resolutions (25FPS and 200FPS, respectively) and their corresponding feature maps and feature points that have been dimensionally reduced by t-SNE.

Dataset

Vocabulary

The vocabulary of the DVS-Lip dataset consists of two-part. The first part of the vocabulary consists of the 25 most frequently confusing word pairs that are selected from the vocabulary of the LRW dataset. The second part consists of another 50 randomly selected words from the vocabulary of the LRW dataset. More details can be found in the paper.

Data Example

Statistics

Framework

Our framework contains three components: 1) projection between the raw event streams and frame-like representations; 2) a multi-branch network with message flow modules (MFM) between different branches; 3) a sequence model that decodes the visual features into words.

Results

Comparisons with existing event-based models and the state-of-the-art video-based models on the DVS-Lip test set. Temporal bin denotes the temporal dimension of the input event frames or video clip. Acc1 denotes the accuracy on the first part of the test set, Acc2 denotes the accuracy on the second part of the test set, and Acc denotes the accuracy on the entire test set.