Abstract
We investigate multiple ways to improve multiple face tracking in unconstrained videos with adaptive features, especially audio information. Our work is divided into two sections:
1. Implement Zhang et al [1] adaptive learning method to improve face tracking.
2. Implement face tracking and detection in unconstrained video using off-the-shelf tools like OpenCV and TensorFlow
Project members:
Zhiyi Li, Contact: zli04@vt.edu
Barnabas Gavin Cangan
Ilya Pozdneev
Introduction/Motivation/Problem Setup
Multiple human face-tracking in unconstrained videos is a challenging and interesting problem. This problem is challenging because people's facial features change drastically in different video frames due to pose, illumination, scale, camera motion, makeup, and huge occlusion [1]. Zhang's et al tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). In their strategy contextual constraints are applied to create a large amount of training samples. Training samples help to optimize the face tracking system the measure of semantic face similarity. Their system demonstrates significant performance improvement over existing techniques. Our objective is to replicate their results, and then attempt to build our own face tracking system using off-the-shelf tools available to us.
Related Work
Zhang et al [1] designed and implemented an automatic multiple face tracking pipeline to apply adapted features pre-trained by CNNs. In their method, as shown in figure 1. Their pipeline include several steps:
1. Pre-training
2. Discovering training samples
3. Learning video-specific features.
4. Learning video-specific features.
5. Linking tracklets.
Their system demonstrates significant performance improvement over existing techniques.
Figure 1. Zhang et. al algorithm Summary[1]
Proposed Additions
There are some interesting research work about audio with video area. For example, V. Kılıç et al. present work on audio assisted robust visual tracking with adaptive particle Filtering [2]. In their work, they did following steps:
Eleonora D'Arca et al. present their work about robust indoor speaker recognition in a network of audio and video sensors [3]. The summary of their work is as following:
Re-run Zhang's Work
We get Zhang's source code from Github link, apply Tara dataset to do 5 steps:
1. mine constraints
2. Learn adaptive discriminative features:
3. Extract features.
4. Perform hierarchical agglomerative clustering algorithm
5. Perform a simple multi-face tracking
We are able to do step 1, generate face track video and face tracklets . The following are videos for face detect and face tracklets. We are able to repeat the whole process.
Youtube video for Face Detection and Face Tracklets for Tara dataset. The Video is generated from a group of images in face detect phase. The code is from Zhang's work [1].
Youtube video for final face tracking results. The Video is generated from the final output images. The code is from Zhang's work [1].
Our Implementation
Face Detection
For face detection we used Haar feature-based cascade classifiers to detect faces in the video. The features were borrowed from OpenCV framework. At first we started with a single face detector, but quickly realized that might not be sufficient. Note the figure below, on the left image only 2 faces out of 4 were detected. This happens because we were using front face features for out detection, however in the sitcom setting this is rarely true.
Later we added a second detector using profile features. Note the figure on the right, similar setting but only 1 face was detected. However, this time the front face detector (denoted in yellow color) failed entirely and only the profile detector (green boxes) succeeded. In the end adding the second detector did improve the results and increased the number of faces we detect.
Face Recognition
The classifier was build using MobileNet architecture provided as a part of the TensorFlow framework, which was then trained to recognize 7 characters from the show. The training dataset was build from a combination of google images and shots from the show. For each cast member google search, the images were downloaded and run through the face detector which extracted the faces as labeled data. Additionally, we manually labeled extracted faces from the video and added them to the google image dataset.
In the figure below the left image displays a successful identification of Howard with 3/4 face position. However, a few frames earlier the camera could only see Howard's profile resulting in a classifier failure. Later we realized these effects were the result of lack of diversity in our dataset, as most images on google search are front facing smiling people, which does not help when trying to classify live actors.
Face Tracking
The proposed solution to face tracking in unconstrained video was the following:
1) Face detection using Haar feature cascading classifier
2) Face recognition using pre-trained CNN on 7 actors
3) Median flow optical tracking until failure (failure is usually a result of scene change or heavy occlusion)
4) After tracking failure return to step 1 and repeat
While we did implement Median flow tracking in one of our early prototypes, it was not used to produce the video below. This was done due to the shortcomings of face detection, since without reliable face detection there is no data to track.
Final implementation algorithm:
1) Parse a video frame
2) Run the frame through frontal face detector
3) Run the frame through profile face detector
4) Location based detection arbitration
5) Expand each face's bounding box
6) Run the face recognition through CNN
Results
Qualitative
Quantitative - CNN Training
Validation accuracy = 79.0%
Test accuracy = 81.4%
The plot below displays the progression of validation (blue) and test accuracy (orange) over the course of 10000 iterations
Conclusion
We have some valuable takeaways from our work on this project. The most important being our appreciation of the difficulties in tackling open problems in Computer Vision From our experience with detection, recognition and tracking algorithms, we have learned that detection is the weakest link of the three, especially in unconstrained sitcom videos, where the actors rarely, if ever, directly face the camera.
References
[1] Shun Zhang, Yihong Gong, Jia-Bin Huang, Jongwoo Lim, Jinjun Wang, Narendra Ahuja and Ming-Hsuan Yang, "Tracking Persons-of-Interest via Adaptive Discriminative Features", ECCV 2016
[2] V. Kılıç, M. Barnard, W. Wang and J. Kittler, "Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering," in IEEE Transactions on Multimedia, vol. 17, no. 2, pp. 186-200, Feb. 2015.
[3] Eleonora D'Arca, Neil M. Robertson, James R. Hopgood, Robust indoor speaker recognition in a network of audio and video sensors, In Signal Processing, Volume 129, 2016, Pages 137-149, ISSN 0165-1684, https://doi.org/10.1016/j.sigpro.2016.04.014.
[4] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).