Image & Vision

Acoustic & Speech

Image & Vision

Natural Language Processing

Image & Vision

Investigates Image data fusion techniques and video analytics that combine image and track data from multiple sensors to achieve improved accuracies and more specific inferences than could be achieved by using a single sensor alone. Our aim is to explore the state-of-the-art image processing and video analytics algorithms for achieving effective enhancement, detection, tracking, and video summarization as in:

Multi-Modal Voice Activity

Multi-Modal Voice Activity Detection Demo

1. Introduction

A. Voice Activity Detection (VAD) is an important step in a voice command recognition system such as speech interface based smart TV and voice command car navigation system. For such a system, using only acoustic information, however, may result performance degradation in noisy environments. Some recent studies have focused on improving VAD performance by using additional information such as video signal However many methods typically suffer difficulties in dealing with illumination change or global motion of image frames. To mitigate these problems, a Local Variance Histogram (LVH) as a statistical measure of lip motion change is proposed here.

2. Main Algorithm and Principle

A. An overview of the proposed algorithm is shown in Fig. 1. First, a whole face is detected, and the eyes are located from the detected face. Positions of the eyes are used to localize the lip region. If it is successful in detecting the lip region, the associated LVH is calculated.

Fig 1. Overview of our VVAD algorithm

B. The measure of Local Variance is as follows:

Fig 2. Local Variance Histogram

C. By observing that an open mouth consists of separated lips and teeth regions exhibiting high intensity values and the gap in-between exhibiting low intensity values, it is hypothesized that LVH of an open mouth would result larger values of histogram in high variance bins. An example of a closed and opened mouth underscoring the observation is shown in Fig. 3. Consequently, the number of pixels with high value of LVH is used as a visual cue for detecting speech as follows:

Fig 3. Example of Local Variance Histogram

D. Consequently, the number of pixels with high value of LVH is used as a visual cue for detecting speech as follows:

Fig 4. Plot of local variance histogram and feature extraction result

3. Demo

Page updated

Report abuse