Introduction
Human action recognition is valuable for numerous practical applications, e.g., gaming, video surveillance, and video search. In this article, we hypothesize that the classification of actions can be boosted by designing smart feature pooling strategy under the prevalently used bag-of-words based representation. Founded on automatic video saliency analysis, we propose the Spatial-Temporal Attention-aware Pooling (STAP) scheme for feature pooling. Firstly, the video saliencies are predicted using the video saliency model, and the localized spatial-temporal features are pooled at different saliency levels and video-saliency-guided channels are formed. Saliency-aware matching kernels are thus derived as the similarity measurement of these channels. Intuitively, the proposed kernels calculate the similarities of the video foreground (salient areas) or background (non-salient areas) at different levels. Finally the kernels are fed into popular support vector machines for action classification.
Figure 1. The illustration of the spatial-temporal attention-aware feature pooling for action recognition. The figure shows our work is superior over spatial pyramid matching due to the implicit background/foreground matchings. The local features are pooled according to (b) traditional SPM pooling with 2.x.2 channels in spatial-temporal domain and (c) the proposed saliency-aware feature pooling with video saliency guided channels.
Paper [Download]
Dense sampling executable [Download]
We use the dense sampling method which extracts HOG, HOF, MBH features from input video.
The executable file is tested on Windows 7.
Usage:
DenseSampling [Input Video Filename] > [Ouput Descriptor Filename]
Example:
DenseSampling .\samples\Diving_Side_001.vob > .\descriptors\Diving_Side_001.vob.txt
Arguments:
- S10: Sampling time = 10 (default: 10)
- L15: Temporal length = 15 (default: 15)
- P32: Patch size = 32 (default: 32)
The dense sampling features are computed one by one, and each one in a single line, with the following format:
frameNumber x y length scale HOG HOF MBHx MBHy
The first 5 elements are information about the sampling information:
frameNumber: The sampling ends on which frame
x: The value of the x coordinates of the sampling point
y: The value of the y coordinates of the sampling point
length: The temporal length
scale: The sampling descriptor is computed on which scale
The following elements are four descriptors concatenated one by one:
HOG: 96 dimensions
HOF: 108 dimensions
MBHx: 96 dimensions
MBHy: 96 dimensions
Convert RAW dense sampling descriptors to BOW using VLFEAT [Download]
Saliency extraction
Itti and Koch’s saliency prediction model (3 maps) [Download]
AIM [Download]
ICL [Download]
SIM [Download]
FT [Download]
LSK [Download]
SR [Download]
GBVS [Download]
Signature-LAB [Download]
SUN [Download]
Cerf et al. [Download]
Human central bias map [Download]
Motion map [use OpenCV function: calcOpticalFlowFarneback]
Kernel SVM with MATLAB code [Download]
If you happen to use our work, please cite the following paper
Tam V. Nguyen, Zheng Song, Shuicheng Yan. “STAP: Spatial-Temporal Attention-aware Pooling for Action Recognition”, IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), 2014.
Contact or password request:
Please drop an email to: vantam@gmail.com