Adaptive Face Tracking

Abstract

We investigate multiple ways to improve multiple face tracking in unconstrained videos with adaptive features, especially audio information. Our work is divided into two sections:

1. Implement Zhang et al [1] adaptive learning method to improve face tracking.

2. Implement face tracking and detection in unconstrained video using off-the-shelf tools like OpenCV and TensorFlow

Project members:

Zhiyi Li, Contact: zli04@vt.edu

Barnabas Gavin Cangan

Ilya Pozdneev

Introduction/Motivation/Problem Setup

Multiple human face-tracking in unconstrained videos is a challenging and interesting problem. This problem is challenging because people's facial features change drastically in different video frames due to pose, illumination, scale, camera motion, makeup, and huge occlusion [1]. Zhang's et al tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). In their strategy contextual constraints are applied to create a large amount of training samples. Training samples help to optimize the face tracking system the measure of semantic face similarity. Their system demonstrates significant performance improvement over existing techniques. Our objective is to replicate their results, and then attempt to build our own face tracking system using off-the-shelf tools available to us.

Related Work

Zhang et al [1] designed and implemented an automatic multiple face tracking pipeline to apply adapted features pre-trained by CNNs. In their method, as shown in figure 1. Their pipeline include several steps:

1. Pre-training

2. Discovering training samples

3. Learning video-specific features.

4. Learning video-specific features.

5. Linking tracklets.

Their system demonstrates signiﬁcant performance improvement over existing techniques.

Figure 1. Zhang et. al algorithm Summary[1]

Proposed Additions

There are some interesting research work about audio with video area. For example, V. Kılıç et al. present work on audio assisted robust visual tracking with adaptive particle Filtering [2]. In their work, they did following steps:

Estimate speaker position probabilities using Visual information
Calculate Direction of Arrival angle (DOA) from Audio information
Estimate the translation of the speaker and final location using (1) and (2).

Eleonora D'Arca et al. present their work about robust indoor speaker recognition in a network of audio and video sensors [3]. The summary of their work is as following:

Generate “voiceprints” using 60s of training data for each speaker
Build a GMM for each speaker
During runtime more voice samples are collected and checked against GMMs to identify speakers in the video

Re-run Zhang's Work

We get Zhang's source code from Github link, apply Tara dataset to do 5 steps:

1. mine constraints

2. Learn adaptive discriminative features:

3. Extract features.

4. Perform hierarchical agglomerative clustering algorithm

5. Perform a simple multi-face tracking

We are able to do step 1, generate face track video and face tracklets . The following are videos for face detect and face tracklets. We are able to repeat the whole process.

Youtube video for Face Detection and Face Tracklets for Tara dataset. The Video is generated from a group of images in face detect phase. The code is from Zhang's work [1].

Youtube video for final face tracking results. The Video is generated from the final output images. The code is from Zhang's work [1].

Our Implementation

Face Detection

For face detection we used Haar feature-based cascade classifiers to detect faces in the video. The features were borrowed from OpenCV framework. At first we started with a single face detector, but quickly realized that might not be sufficient. Note the figure below, on the left image only 2 faces out of 4 were detected. This happens because we were using front face features for out detection, however in the sitcom setting this is rarely true.

Later we added a second detector using profile features. Note the figure on the right, similar setting but only 1 face was detected. However, this time the front face detector (denoted in yellow color) failed entirely and only the profile detector (green boxes) succeeded. In the end adding the second detector did improve the results and increased the number of faces we detect.

Face Recognition

The classifier was build using MobileNet architecture provided as a part of the TensorFlow framework, which was then trained to recognize 7 characters from the show. The training dataset was build from a combination of google images and shots from the show. For each cast member google search, the images were downloaded and run through the face detector which extracted the faces as labeled data. Additionally, we manually labeled extracted faces from the video and added them to the google image dataset.

In the figure below the left image displays a successful identification of Howard with 3/4 face position. However, a few frames earlier the camera could only see Howard's profile resulting in a classifier failure. Later we realized these effects were the result of lack of diversity in our dataset, as most images on google search are front facing smiling people, which does not help when trying to classify live actors.

Face Tracking

The proposed solution to face tracking in unconstrained video was the following:

1) Face detection using Haar feature cascading classifier

2) Face recognition using pre-trained CNN on 7 actors

3) Median flow optical tracking until failure (failure is usually a result of scene change or heavy occlusion)

4) After tracking failure return to step 1 and repeat

While we did implement Median flow tracking in one of our early prototypes, it was not used to produce the video below. This was done due to the shortcomings of face detection, since without reliable face detection there is no data to track.

Final implementation algorithm:

1) Parse a video frame

2) Run the frame through frontal face detector

3) Run the frame through profile face detector

4) Location based detection arbitration

If frontal and profile detector found the same face (Cartesian distance based similarity) priority is given to the frontal detection

5) Expand each face's bounding box

It was found that expanding the bounding box to include more facial features, like chin and hair, CNN was performing ~6-7% better both in training and validation

6) Run the face recognition through CNN

Results

Qualitative

Quantitative - CNN Training

Validation accuracy = 79.0%

Test accuracy = 81.4%

The plot below displays the progression of validation (blue) and test accuracy (orange) over the course of 10000 iterations

Conclusion

We have some valuable takeaways from our work on this project. The most important being our appreciation of the difficulties in tackling open problems in Computer Vision From our experience with detection, recognition and tracking algorithms, we have learned that detection is the weakest link of the three, especially in unconstrained sitcom videos, where the actors rarely, if ever, directly face the camera.

References

[1] Shun Zhang, Yihong Gong, Jia-Bin Huang, Jongwoo Lim, Jinjun Wang, Narendra Ahuja and Ming-Hsuan Yang, "Tracking Persons-of-Interest via Adaptive Discriminative Features", ECCV 2016

[2] V. Kılıç, M. Barnard, W. Wang and J. Kittler, "Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering," in IEEE Transactions on Multimedia, vol. 17, no. 2, pp. 186-200, Feb. 2015.

[3] Eleonora D'Arca, Neil M. Robertson, James R. Hopgood, Robust indoor speaker recognition in a network of audio and video sensors, In Signal Processing, Volume 129, 2016, Pages 137-149, ISSN 0165-1684, https://doi.org/10.1016/j.sigpro.2016.04.014.

[4] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

Report abuse