ECCV16 Face Tracking

Tracking Persons-of-Interests via Adaptive Discriminative Features

ECCV 2016

Shun Zhang1, Yihong Gong1, Jia-Bin Huang2, Jongwoo Lim3, Jinjun Wang1, Narendra Ahuja2 and Ming-Hsuan Yang4

1Xi'an Jiaotong University, 2University of Illinois, Urbana-Champaign, 3Hanyang University, 4University of California, Merced

Abstract

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Low- level features used in existing multi-target tracking methods are not effective for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). Unlike existing CNN-based approaches that are only trained on large-scale face image datasets offline, we further adapt the pre-trained face CNN to specific videos using automatically discovered training samples from tracklets. Our network directly optimizes the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity. This is technically realized by minimizing an improved triplet loss function. With the learned discriminative features, we apply the Hungarian algorithm to link tracklets within each shot and the hierarchical clustering algorithm to link tracklets across multiple shots to form final trajectories. We extensively evaluate the proposed algorithm on a set of TV sitcoms and music videos and demonstrate significant performance improvement over existing techniques.

Multi-face Tracking

We focus on tracking multiple faces according to their unknown identities in unconstrained videos, which consist of many shots from different cameras. The main challenge is to address large face appearance variations from different shots due to changes in pose, view angle, scale, makeup, illumination, camera motion and heavy occlusions.

Overview

We illustrate the four main steps of our algorithm in Figure 2:

(a) Pre-training: We pre-train a CNN model based on the AlexNet architecture [37] using an external large-scale face recognition dataset to learn identity-preserving features (Section 6.1).

(b) Automatic training sample discovery: We detect shot changes and divide the video into non-overlapping shots. Within each shot, we apply an offline-trained face detector and link adjacent detections into short tracklets. We discover a large collection of face pairs or face triplets from tracklets based on spatio-temporal constraints (Section 4.1).

(c) Adaptive feature learning: We adapt the pre-trained CNN using the automatically discovered training samples to address large appearance changes of the imaged faces presented in a specific video (Section 4.2). For adapting the CNN, we first introduce two types of loss functions for optimizing the embedding space: the contrastive loss and the triplet loss. Moreover, we present a new triplet loss to improve the discriminative ability of learned features (Section 4.3).

(d) Linking tracklets: For each shot, we use conventional multi-face tracking methods to link tracklets into short trajectories. We use a hierarchically clustering algorithm to link trajectories across shots. We assign the tracklets in each cluster with the same identity (Section 5).

2D t-SNE Visuliazation

2D tSNE visualization of all face features from the proposed fine-tuned CNN for adapting video-specific variations, compared with HOG, AlexNet, and Pre-trained features. T-ara has 6 main casts. The faces of different people are color coded.

HOG

AlexNet

Pre-train

Proposed

Music video dataset

We contribute a new dataset consisting of 8 music videos from YouTube: T-ara, Westlife, Pussycat Dolls, Apink, Darling, Bruno Mars, Hello Bubble and Girls Aloud. Three of the sequences (T-ara, Westlife and Pussycat Dolls) are live vocal concert recordings from multiple cameras with different views. The other sequences (Bruno Mars, Apink, HelloBubble, Darling and Girls Aloud) are MTV videos. The new dataset presents a new set of challenges (e.g., frequent shot/scene changes, large appearance variations, and rapid camera motion) that are crucial for developing multi-face tracking algorithms in unconstrained environments.

We provide full annotations of 3,845 face tracklets and 117,598 face detections. The videos with ground-truth annotations can be downloaded in BaiduYun.

Tracking results

Paper

[paper] [poster]

Code and models

[code] [data: Dropbox or BaiduYun]

Citation

@inproceedings{Zhang-ECCV-2016, author = {Zhang, Shun and Gong, Yihong and Huang, Jia-Bin and Lim, Jongwoo and Wang, Jinjun and Ahuja, Narendra and Yang, Ming-Hsuan}, title = {Tracking Persons-of-Interest via Adaptive Discriminative Features}, booktitle = {European Conference on Computer Vision}, pages = {415--433}, year = {2016}, organization={Springer} }