Deep Motion Prior for Weakly-Supervised Temporal Action Localization

Meng Cao, Can Zhang, Long Chen, Mike Zheng Shou, Yuexian Zou

Abstract

Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness. Specifically, a motion graph is introduced to model motionness based on the local motion carrier (e.g., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.

Method

Fig. Schematic illustration of the proposed DMP-Net, which consists of two branches: (a) Base branch to produce class-specific probabilities (TCAS) and (b) Guidance branch to output class-agnostic deep motion prior. In the base branch, for each channel (category) of TCAS, top-k terms with largest values (marked as red nodes) are selected to aggregate the video-level classification results. In the guidance branch, the corresponding item in the motionness sequence is also selected and fed to our Motion-guided Loss $\mathcal{L}_{g}$. For clarity, we only show the motionness selection for the first channel and the remaining channels are similar.

Results

1) Quantitative Results on THUMOS'14.

2) Qualitative Results (Video Demo)