Are All Frames Equal? Active Sparse Labeling for Video Action Detection

Aayush J Rana, Yogesh S Rawat

NeurIPS 2022

Paper

Full paper with supplementary material

[PDF]

Code

Setup instructions, code and models

[Github]

Resources

Additional details, slides, posters and results.

[CRCV] [NeurIPS]

Overview

In this work, we focus on reducing the annotation effort for video action detection. The existing work in label efficient learning for action detection is mostly focused on semi-supervised or weakly-supervised approaches. They rely on separate (often external) actor detectors and tube linking methods coupled with weakly-supervised multiple instance learning or pseudo-annotations, limiting the practical simplicity for general use. We argue that a lack of selection criteria for annotating only informative data is one of the limitations in these methods. Motivated by this, we propose active sparse labeling (ASL) which bridges this gap between high performance and low annotation cost. ASL performs partial instance annotation (sparse labeling) by frame level selection where the goal is to annotate most informative frames, which are expected to be useful for activity detection task.

We make the following contributions in this work:

We propose Active Sparse Labeling (ASL), a novel active learning (AL) strategy for action detection where each instance is partially annotated to reduce the labeling cost. This is the first work focused on AL for video action detection to best of our knowledge.
We propose Adaptive Proximity-aware Uncertainty (APU), a novel scoring mechanism for selecting informative and diverse set of frames from each video.
We also propose Max-Gaussian Weighted Loss (MGW-Loss), a novel training objective which helps in effectively learning from sparse labels.

Adaptive Proximity-aware Uncertainty (APU)

Uncertainty as frame utility

Use MC-dropout as model’s uncertainty for each pixel and average them for frame score

Adaptive proximity estimation

We use a normal distribution centered around annotated frame

Overall APU is computed as

Proposed approach

Estimate each frame’s utility using APU
- APU adjusts for redundancy and diversity of frames
Informative frame selection
- Select highest utility frame
- Re-score remaining frames using APU again
- Only re-compute distance measure (no model inference required)
- Select frames based on budget for AL round
Non-activity suppression
- Avoid influence of large background regions
- Ignore highly certain background pixels for APU computation
- Focus more of possible foregrounds (action region)

Evaluation results

For UCF-101, we initialize with 1% of labelled frames and train the action detection model with a step size of 5% in each cycle. We achieve results very close to full annotations (v-mAP@0.5: 73.20 vs 75.12) using only 10% of annotated frames, which is a huge reduction (90%) in the annotation cost. For J-HMDB, we initialize with 3% labels as it is a relatively smaller dataset and it is challenging to train an initial model with just 1% labels. Here, we obtain results comparable with 100% annotations with only 9% of labels. Compared with prior weakly/semi-supervised methods, we outperform them as our ASL is able to learn with the pseudo-labels for sparse annotation setting while the AL cycle is able to select frames spatio-temporally useful for action detection.

Qualitative results

Analysis of frame selection using different methods. The x-axis represents all frames of the video, with each row representing a baseline method. The markers for each method mark the frames selected using that method. For both samples, our method selects distributed frames centered around action region, Gal et al [73] [G*] selects frame around same region since there is no distance measure and Aghdam et al. [53] [A*] selects slightly more distributed frames but those are not from crucial action region. [G*:Gal et al[73], A*:Aghdam et al[53], Rand: Random, Equi: Equidistant]

We select fewer frames with higher diversity and utility
- Performs better than G* [53], A* [73] (prior methods) and random and equidistant selection

[53] Hamed H Aghdam, Abel Gonzalez-Garcia, Joost van de Weijer, and Antonio M López. Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

[73] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.

Analysis

Frame selection methods

All methods use our MGW-loss to handle sparse labels
APU performs better at lower annotation cost
- Handles proximity in videos better than entropy and uncertainty methods

Loss formulations

Masking doesn’t utilize pseudo-labels and performs lower
Interpolation improves overall detection
MGW uses weight based on proximity to ground truth
- Directs network on how much to trust pseudo-labels based on reliability

Bibtex

@inproceedings{rana2022are,

title={Are all Frames Equal? Active Sparse Labeling for Video Action Detection},

author={Rana, Aayush J and Rawat, Yogesh S},

booktitle={Advances in Neural Information Processing Systems},

year={2022}

}

Team

Aayush J Rana

CRCV, UCF

Yogesh S Rawat

CRCV, UCF