Discriminative Segment Annotation in Weakly Labeled Video

The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.

Our Problem:
The problem we seek to address is semantic object segmentation in weakly labeled video. Given a weakly tagged video with the "dog" concept [top], we first perform unsupervised segmentation [middle]. Our method identifies segments that correspond to the label to generate a semantic segmentation [bottom]. These are actual segmentations obtained by our system.

Our Solution:

As input, our algorithm is given sets of positively and negatively tagged videos, corresponding to weak labels. We start by processing the video using a standard unsupervised spatiotemporal segmentation method that aims to preserve object boundaries. Then, with these segments as input, our algorithm CRANE ranks the segments in the positive videos based on their probability of belonging to the given concept. We evaluate our method in the context of two scenarios, transductive segment annotation (TSA) and inductive segment annotation (ISA). In TSA, we directly evaluate the precision-recall of the ranked positive segments in our training videos. In ISA, we train a segment classifier using the ranked segments returned by CRANE, and evaluate precision-recall on a separate, disjoint test set of videos. TSA corresponds to the scenario where we would like to automatically obtain supervised training data, whereas ISA corresponds to the scenario where we would like to use this supervised training data to perform semantic object segmentation in novel videos.

Sample Results:

Discriminative Segment Annotation in Weakly Labeled Video
Kevin Tang, Rahul Sukthankar, Jay Yagnik, Li Fei-Fei
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

Annotations for the YouTube-Objects dataset are available here.
The original dataset can be found here.