Full paper 



Setup instructions, code and models



Additional details, slides, posters and results.

[CRCV] [CVPR 2023]



In this work, we study how high annotation cost for spatiotemporal detection can be reduced with minimal performance trade-off. The existing works on label efficient learning using weakly-supervised or semi-supervised methods to save annotation costs. Such approaches have been effective for classification tasks, however spatio-temporal detection is more challenging under limited annotations with inferior performance compared to fully supervised methods. One of the main limitations of these methods is lack of selection criteria which can guide in labeling only informative samples. To overcome this limitation, we investigate the use of active learning for label efficient video action detection. Traditional active learning (AL) approach typically focuses on classification task where selection is performed at sample level. We explore a hybrid active learning strategy which performs both intra-sample and inter-sample selection. The intra-sample selection targets informative frames within a video and inter-sample selection aims at informative samples at video-level as shown in Figure 1. This hybrid approach results in efficient labeling by significantly reducing the annotation costs.

We make the following contributions in this work:

1)     Novel hybrid AL strategy that selects frames and videos based on informativeness and diversity

2)     Clustering based selection criteria that enables diversity in sample selection

3)     Novel training objective for effective utilization of limited labels using temporal continuity.


We evaluate the proposed approach on UCF-101-24 and JHMDB-21 and demonstrate that it outperforms other AL baselines and achieves comparable performance with model trained on 90% annotations at a fraction (5% vs 90%) of the annotation cost. 

Figure 1: Overview of different active learning strategies for sample selection. We show a toy example for selection strategy as we add more annotatons to set 1 to obtain set 2. Sample selection approach takes unlabeled sample and annotates all frames in it. Intra-sample selects frames from all samples to annotate for the next set. Hybrid selects important samples and high utility frames to annotate for next set, significantly reducing overall annotation cost.

Proposed approach

Overview of the proposed approach. Model training use videos with partial labels to learn action detection using the STeW loss and classification loss while also learning cluster assignment via cluster loss. The CLAUS hybrid active learning uses a trained model’s output for intra sample selection and cluster assignment Cv for a video. Intra sample selection uses model score and selects top A_t frames of a video to get the video score (V_score). The V_score and Cv is used for inter sample selection and selected samples are sent to oracle for annotation. UV: Unlabeled videos.


We evaluate our approach on UCF-101-24 and J-HMDB-21 video action detection datasets. We measure the standard frame-mAP and video-mAP scores for different thresholds to evaluate our model’s action detection results following prior works. The frame-mAP reflects the average precision of detection at the frame level for each class, which is then averaged to obtain the f-mAP. The video-mAP reflects the average precision at the video level, which is averaged to obtain the v-mAP score.

Table 1: Comparison with state-of-the-art weakly-supervised methods on UCF-101-24. We evaluate our approach on v-mAP and f-mAP scores using only 1% and 5% total frame annotations. ‘V’ uses video-level annotations and ‘P’ uses a fraction of the mixed annotation. ‘S’ denotes SSL methods. We report [64] with their scores for 2 (1.1%) and 5 (2.8%) frames annotated per video.

Table 2: Comparison with state-of-the-art semi-supervised methods on J-HMDB-21 using only 1% and 5% total frames annotation. ‘V’ uses video-level class annotations. ‘S’ denotes SSL method. We report [64] with their scores for 2 (6%) and 5 (15%) frames annotated per video.

We show the evaluation of UCF-101-24 and J-HMDB-21 in table 1 and 2 respectively. For UCF-101-24, we start at 0.25% total annotations and increase it iteratively using our CLAUS selection method to 5%. For J-HMDB-21 we start at 0.15% and increase to 5% using CLAUS. Our cluster-based video and frame selection approach selects limited samples and can also be compared with prior weakly supervised methods for video action detection. Prior weakly supervised methods rely on off-the-shelf actor detector or user-generated points to create GT annotations for training. These rely on multiple external components or require user to annotate points in each frame, reducing their practical use. Our approach doesn’t rely on external detection components and uses simple iterative approach to select useful limited samples. This allows our method to be easily used for training.


Selection strategy

We analyze the effect of proposed CLAUS selection method by comparing with other existing selection methods in Figure 3. We compare our method with random, equidistant, entropy-based [1] and uncertainty-based [14] AL baselines for UCF-101-24 and J-HMDB-21. Random and equidistant give an idea of non-parametric sample selection where the videos are selected at random and the frames are selected at random or equidistant. We notice that these baselines give lowest scores. Then we compare with other AL baselines using [1,14]. Since these are image-based, they are not well suited for frame ranking in videos as reflected by their scores. [1] ignores nearest 5 frames for each selection, but this still does not work as well as proposed diverse selection. Since these prior AL baselines don’t have notion of similarity/distance for videos, we see that random performs comparably. In contrast, our approach gives best performance, highlighting the impact of cluster based diverse sample selection.

Figure 3: Evaluating various scoring methods for AL annotation increments. * uses our STeW Loss for all selection approaches on UCF-101-24(a-b) and J-HMDB-21(c-d).

Effect of clustering

We evaluate the effect of clustering for video selection in our approach in Figure 4. The selection approach without clustering simply selects top-k videos for further annotation, which ends up selecting some similar samples as it does not take diversity into account. Clustering increases sample diversity which improves overall performance.

Effect of STeW loss

To evaluate the effect of our proposed STeW loss, we train the action detection network using simple frame loss and interpolation loss for UCF-101-24 dataset. Frame loss only computes loss for the annotated frame and ignores the pseudo-labels while interpolation loss simply computes loss for all real and pseudo-labels equally. We use the same AL algorithm for all the approaches and show the result for UCF-101-24 for different steps in Figure 5.

Figure 4: Comparison of our approach with and without clustering based selection for UCF-101-24(a) and J-HMDB-21(b).

Figure 5: Comparison of proposed STeW loss with different loss variations combined with our CLAUS selection to train the video action detection network for UCF-101-24 dataset.

Cost analysis

Figure 6 (a-b) compares cost to performance relation of our method and random selection. While having more annotation generally improves performance, our method selects diverse and important frames compared to random selection, resulting in significantly improved model in each step for the same cost.

Figure 6: (a-b) Performance evaluation of our method with random selection baseline on UCF-101-24 for various sample annotation percent. The cost of annotation for each step is shown by the shaded bars, with the cost value in the right axis in thousands. (c-d) Performance difference for increasing sample and frame annotations [5%] vs increasing only frame annotations [10%] on UCF-101-24. Increasing both sample and frames at 5% increment adds diversity compared to only increasing frames, giving better scores.


In this work we present a novel hybrid AL strategy for reducing annotation cost for video action detection. Our hybrid approach uses clustering-aware strategy to select informative and diverse samples to reduce sample redundancy while also doing intra-sample selection to reduce frame annotation redundancy. We also propose a novel STeW loss to help the model train with limited annotations, removing the need for dense annotations for video action detection. In contrast to traditional AL approach, our proposed hybrid approach adds more annotation diversity at the same cost. We evaluate the proposed approach on two different action detection datasets demonstrating its effectiveness in learning from limited labels with minimal trade-off on the performance.


Aayush J Rana


Yogesh S Rawat