Paper

Full paper with supplementary material

[PDF] 

Code

Setup instructions, code and models

[Github] 

Resources

Additional details, slides, posters and results.

[CRCV] [NeurIPS] 

Overview

In this work, we focus on reducing the annotation effort for video action detection. The existing work in label efficient learning for action detection is mostly focused on semi-supervised or weakly-supervised approaches. They rely on separate (often external) actor detectors and tube linking methods coupled with weakly-supervised multiple instance learning or pseudo-annotations, limiting the practical simplicity for general use. We argue that a lack of selection criteria for annotating only informative data is one of the limitations in these methods. Motivated by this, we propose active sparse labeling (ASL) which bridges this gap between high performance and low annotation cost. ASL performs partial instance annotation (sparse labeling) by frame level selection where the goal is to annotate most informative frames, which are expected to be useful for activity detection task. 

We make the following contributions in this work:


Adaptive Proximity-aware Uncertainty (APU)

Uncertainty as frame utility

Use MC-dropout as model’s uncertainty for each pixel and average them for frame score

Adaptive proximity estimation

We use a normal distribution centered around annotated frame



Overall APU is computed as

Proposed approach


Evaluation results

For UCF-101, we initialize with 1% of labelled frames and train the action detection model with a step size of 5% in each cycle. We achieve results very close to full annotations (v-mAP@0.5: 73.20 vs 75.12) using only 10% of annotated frames, which is a huge reduction (90%) in the annotation cost. For J-HMDB, we initialize with 3% labels as it is a relatively smaller dataset and it is challenging to train an initial model with just 1% labels. Here, we obtain results comparable with 100% annotations with only 9% of labels. Compared with prior weakly/semi-supervised methods, we outperform them as our ASL is able to learn with the pseudo-labels for sparse annotation setting while the AL cycle is able to select frames spatio-temporally useful for action detection. 

Qualitative results

Analysis of frame selection using different methods. The x-axis represents all frames of the video, with each row representing a baseline method. The markers for each method mark the frames selected using that method. For both samples, our method selects distributed frames centered around action region, Gal et al [73] [G*] selects frame around same region since there is no distance measure and Aghdam et al. [53] [A*] selects slightly more distributed frames but those are not from crucial action region. [G*:Gal et al[73], A*:Aghdam et al[53], Rand: Random, Equi: Equidistant]

[53] Hamed H Aghdam, Abel Gonzalez-Garcia, Joost van de Weijer, and Antonio M López. Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

[73] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016. 

Analysis

Frame selection methods


Loss formulations


Bibtex

@inproceedings{rana2022are,

title={Are all Frames Equal? Active Sparse Labeling for Video Action Detection},

author={Rana, Aayush J and Rawat, Yogesh S},

booktitle={Advances in Neural Information Processing Systems},

year={2022}

}


Team

Aayush J Rana

CRCV, UCF 

Yogesh S Rawat

CRCV, UCF