Aayush J Rana, Yogesh S Rawat
NeurIPS 2022
Overview
In this work, we focus on reducing the annotation effort for video action detection. The existing work in label efficient learning for action detection is mostly focused on semi-supervised or weakly-supervised approaches. They rely on separate (often external) actor detectors and tube linking methods coupled with weakly-supervised multiple instance learning or pseudo-annotations, limiting the practical simplicity for general use. We argue that a lack of selection criteria for annotating only informative data is one of the limitations in these methods. Motivated by this, we propose active sparse labeling (ASL) which bridges this gap between high performance and low annotation cost. ASL performs partial instance annotation (sparse labeling) by frame level selection where the goal is to annotate most informative frames, which are expected to be useful for activity detection task.
We make the following contributions in this work:
We propose Active Sparse Labeling (ASL), a novel active learning (AL) strategy for action detection where each instance is partially annotated to reduce the labeling cost. This is the first work focused on AL for video action detection to best of our knowledge.
We propose Adaptive Proximity-aware Uncertainty (APU), a novel scoring mechanism for selecting informative and diverse set of frames from each video.
We also propose Max-Gaussian Weighted Loss (MGW-Loss), a novel training objective which helps in effectively learning from sparse labels.
Adaptive Proximity-aware Uncertainty (APU)
Uncertainty as frame utility
Use MC-dropout as model’s uncertainty for each pixel and average them for frame score
Adaptive proximity estimation
We use a normal distribution centered around annotated frame
Overall APU is computed as
Proposed approach
Estimate each frame’s utility using APU
APU adjusts for redundancy and diversity of frames
Informative frame selection
Select highest utility frame
Re-score remaining frames using APU again
Only re-compute distance measure (no model inference required)
Select frames based on budget for AL round
Non-activity suppression
Avoid influence of large background regions
Ignore highly certain background pixels for APU computation
Focus more of possible foregrounds (action region)
Evaluation results
For UCF-101, we initialize with 1% of labelled frames and train the action detection model with a step size of 5% in each cycle. We achieve results very close to full annotations (v-mAP@0.5: 73.20 vs 75.12) using only 10% of annotated frames, which is a huge reduction (90%) in the annotation cost. For J-HMDB, we initialize with 3% labels as it is a relatively smaller dataset and it is challenging to train an initial model with just 1% labels. Here, we obtain results comparable with 100% annotations with only 9% of labels. Compared with prior weakly/semi-supervised methods, we outperform them as our ASL is able to learn with the pseudo-labels for sparse annotation setting while the AL cycle is able to select frames spatio-temporally useful for action detection.
Qualitative results
Analysis of frame selection using different methods. The x-axis represents all frames of the video, with each row representing a baseline method. The markers for each method mark the frames selected using that method. For both samples, our method selects distributed frames centered around action region, Gal et al [73] [G*] selects frame around same region since there is no distance measure and Aghdam et al. [53] [A*] selects slightly more distributed frames but those are not from crucial action region. [G*:Gal et al[73], A*:Aghdam et al[53], Rand: Random, Equi: Equidistant]
We select fewer frames with higher diversity and utility
Performs better than G* [53], A* [73] (prior methods) and random and equidistant selection
[53] Hamed H Aghdam, Abel Gonzalez-Garcia, Joost van de Weijer, and Antonio M López. Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
[73] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
Analysis
Frame selection methods
All methods use our MGW-loss to handle sparse labels
APU performs better at lower annotation cost
Handles proximity in videos better than entropy and uncertainty methods
Loss formulations
Masking doesn’t utilize pseudo-labels and performs lower
Interpolation improves overall detection
MGW uses weight based on proximity to ground truth
Directs network on how much to trust pseudo-labels based on reliability
Bibtex
@inproceedings{rana2022are,
title={Are all Frames Equal? Active Sparse Labeling for Video Action Detection},
author={Rana, Aayush J and Rawat, Yogesh S},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
Team
Aayush J Rana
CRCV, UCF
Yogesh S Rawat
CRCV, UCF